Unit 1 Module 4
Spatial Data Analysis with R (01:450:320)
1 Data analysis and visualization
We are now moving onto the final module of this unit. We’ll learn now
about manipulating, analyzing, and visualizing data. We’ll start working
with tidyverse tools more now.
To access the example datasets and files for this course, you must
first install the auxiliary R package sdadata using the
following command. Since the repository is public, you do not need to
configure a Personal Access Token (PAT) for access.
For this module we will be using the following packages:
We will start making extensive use of %>% (the pipe
operator). You will save yourself much time by using the keyboard
shortcut for inserting that operator, which is
CMD + shift + m on a Mac, ctrl + shift + m on
Windows. You will wear your fingers out typing the three characters out
each time (and also not get the automatic space on either side).
2 Data preparation
The first thing we need is some data to work with. R comes with many
built-in datasets, but when you are out in the wild and working on your
own with R, you are generally going to get your datasets
from somewhere else. That means that you will need to read them into
R. Once you have read them in, you will often need to clean
them up and rearrange them before they are ready to analyze. This
section focuses on reading in and organizing your data in
R. We are going to work with files in the commonly used csv
data format.
2.1 Reading in and writing out data
We are going to work with three csv files that I downloaded from FAOStat, one of the
primary sources for agricultural census data. The data I downloaded
represent the planted areas and harvest quantities of maize, wheat, and
sorghum for the years 1961-2024 for Tanzania and Zambia. They are
bundled up here with sdadata, so you can access them by
reading them in as system.files.
We can easily read in each file using base R’s
read.csv function, which is widely used. However, we are
going to skip straight to a more powerful csv reader provided by the
tidyverse’s readr package. Even
more powerful is data.table::fread, which is excellent (and
as far as I know, still the fastest on the market) for reading in very
large csv files, but we are sticking with the tidyverse. First, for
grins, here is how you would use read.csv
f <- system.file("extdata/FAOSTAT_maize.csv", package = "sdadata")
maize <- read.csv(f)
head(maize)
#> Domain.Code Domain Area.Code..FAO. Area Element.Code
#> 1 QCL Crops and livestock products 215 United Republic of Tanzania 5312
#> 2 QCL Crops and livestock products 215 United Republic of Tanzania 5510
#> 3 QCL Crops and livestock products 215 United Republic of Tanzania 5312
#> 4 QCL Crops and livestock products 215 United Republic of Tanzania 5510
#> 5 QCL Crops and livestock products 215 United Republic of Tanzania 5312
#> 6 QCL Crops and livestock products 215 United Republic of Tanzania 5510
#> Element Item.Code..CPC. Item Year.Code Year Unit Value Flag Flag.Description Note
#> 1 Area harvested 112 Maize (corn) 1961 1961 ha 790000 E Estimated value
#> 2 Production 112 Maize (corn) 1961 1961 t 590000 E Estimated value
#> 3 Area harvested 112 Maize (corn) 1962 1962 ha 800000 E Estimated value
#> 4 Production 112 Maize (corn) 1962 1962 t 600000 E Estimated value
#> 5 Area harvested 112 Maize (corn) 1963 1963 ha 960000 E Estimated value
#> 6 Production 112 Maize (corn) 1963 1963 t 850000 E Estimated valueNow the readr way
maize <- readr::read_csv(f)
#> Rows: 256 Columns: 15
#> ── Column specification ────────────────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (9): Domain.Code, Domain, Area, Element, Item, Unit, Flag, Flag.Description, Note
#> dbl (6): Area.Code..FAO., Element.Code, Item.Code..CPC., Year.Code, Year, Value
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
maize
#> # A tibble: 256 × 15
#> Domain.Code Domain Area.Code..FAO. Area Element.Code Element Item.Code..CPC. Item Year.Code
#> <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 QCL Crops and… 215 Unit… 5312 Area h… 112 Maiz… 1961
#> 2 QCL Crops and… 215 Unit… 5510 Produc… 112 Maiz… 1961
#> 3 QCL Crops and… 215 Unit… 5312 Area h… 112 Maiz… 1962
#> 4 QCL Crops and… 215 Unit… 5510 Produc… 112 Maiz… 1962
#> 5 QCL Crops and… 215 Unit… 5312 Area h… 112 Maiz… 1963
#> 6 QCL Crops and… 215 Unit… 5510 Produc… 112 Maiz… 1963
#> 7 QCL Crops and… 215 Unit… 5312 Area h… 112 Maiz… 1964
#> 8 QCL Crops and… 215 Unit… 5510 Produc… 112 Maiz… 1964
#> 9 QCL Crops and… 215 Unit… 5312 Area h… 112 Maiz… 1965
#> 10 QCL Crops and… 215 Unit… 5510 Produc… 112 Maiz… 1965
#> # ℹ 246 more rows
#> # ℹ 6 more variables: Year <dbl>, Unit <chr>, Value <dbl>, Flag <chr>, Flag.Description <chr>,
#> # Note <chr>That reads it in as a tibble, rather than a data.frame,
but remember that a tibble is an enhanced
data.frame.
Right away we can see that the data look kind of messy. There isn’t
anything I see in that preview that tells us much about the data.
readr at least gives a summary of data columns and their
type on read in. So let’s inspect what’s there:
library(dplyr)
maize %>% slice(1) # same as maize[1, ]
#> # A tibble: 1 × 15
#> Domain.Code Domain Area.Code..FAO. Area Element.Code Element Item.Code..CPC. Item Year.Code
#> <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 QCL Crops and … 215 Unit… 5312 Area h… 112 Maiz… 1961
#> # ℹ 6 more variables: Year <dbl>, Unit <chr>, Value <dbl>, Flag <chr>, Flag.Description <chr>,
#> # Note <chr>That’s the first row of data from maize, using
dplyr::slice instead of maize[1, ] to get it.
We don’t need everything in there, and are actually interested in just a
few columns actually. We’ll get to how we select those data in the next
section.
First, let’s read in all three datasets:
# Chunk 1
# #1
fs <- dir(system.file("extdata/", package = "sdadata"),
pattern = "FAOSTAT", full.names = TRUE)
fs
#> [1] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata//FAOSTAT_maize.csv"
#> [2] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata//FAOSTAT_sorghum.csv"
#> [3] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata//FAOSTAT_wheat.csv"
# #2
crops <- lapply(fs, readr::read_csv)
#> Rows: 256 Columns: 15
#> ── Column specification ────────────────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (9): Domain.Code, Domain, Area, Element, Item, Unit, Flag, Flag.Description, Note
#> dbl (6): Area.Code..FAO., Element.Code, Item.Code..CPC., Year.Code, Year, Value
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 256 Columns: 15
#> ── Column specification ────────────────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (9): Domain.Code, Domain, Area, Element, Item, Unit, Flag, Flag.Description, Note
#> dbl (6): Area.Code..FAO., Element.Code, Item.Code..CPC., Year.Code, Year, Value
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 256 Columns: 15
#> ── Column specification ────────────────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (9): Domain.Code, Domain, Area, Element, Item, Unit, Flag, Flag.Description, Note
#> dbl (6): Area.Code..FAO., Element.Code, Item.Code..CPC., Year.Code, Year, Value
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.We used good old lapply to read all three files into a
list (#2), after we used the dir function with
system.file to retrieve the paths to the three csvs (#1).
Let’s break that down:
# Chunk 2
# #1
system.file("extdata", package = "sdadata")
#> [1] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata"
#
# #2
dir(system.file("extdata", package = "sdadata"))
#> [1] "FAOSTAT_maize.csv" "FAOSTAT_sorghum.csv" "FAOSTAT_wheat.csv"
#
# #3
dir(system.file("extdata", package = "sdadata"), pattern = "FAOSTAT")
#> [1] "FAOSTAT_maize.csv" "FAOSTAT_sorghum.csv" "FAOSTAT_wheat.csv"
#
# #4
dir(system.file("extdata", package = "sdadata"), pattern = "FAOSTAT",
full.names = TRUE)
#> [1] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata/FAOSTAT_maize.csv"
#> [2] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata/FAOSTAT_sorghum.csv"
#> [3] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata/FAOSTAT_wheat.csv"In #1, we get the file path to the extdata folder in the
sdadata installed package. #2 gives shows us the names of
all the files in there. #3 shows us narrows the listed files down to
those whose names contains “FAOSTAT”, and then #4 returns the full file
paths for each.
So that’s a quick introduction to how one can construct a file path and read in table data stored in a csv.
Let’s say we want to write out some data:
# Chunk 3
# #1
set.seed(1)
dat <- tibble(a = sample(1:10), b = rnorm(10))
# #2
td <- tempdir()
# #3
readr::write_csv(dat, path = file.path(td, "dummy.csv"))
# #4
readr::read_csv(file.path(td, "dummy.csv"))
#> Rows: 10 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (2): a, b
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 10 × 2
#> a b
#> <dbl> <dbl>
#> 1 9 -0.820
#> 2 4 0.487
#> 3 7 0.738
#> 4 1 0.576
#> 5 2 -0.305
#> 6 5 1.51
#> 7 3 0.390
#> 8 10 -0.621
#> 9 6 -2.21
#> 10 8 1.12In #1 we do the usual creation of dummy data, but we replace
data.frame with tibble, the enhanced
data.frame–note we could have used data.frame.
#2 creates a temporary directory, which allows me to do something to
code this demonstration in a way that will run on your computer
(i.e. you won’t have to create a directory with a name and path that
matches one on my computer for this code to execute at install time). #3
uses readr::write_csv to write it onto disk, and we read it
back in in #4.
2.2 Getting your data clean and in shape
As we have already seen, our three crop datasets are messy. There are columns we don’t need, and we are not sure if we want the row structure as it is. So we have to prepare our data.
2.2.1 Tidy data
This introduces the concept of tidy data, which is the foundational concept for the tidyverse. There is a whole paper written on the tidy data concept by Hadley Wickham, which is here. To quote the key principles:
Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
…Messy data is any other arrangement of the data.
Please do read that site to get a full understanding of it, as we are going to be reorganizing our data according to these principles.
2.2.2 Selecting
Let’s start by getting rid of some of the extraneous variables in our
data. We’ll start with just the maize dataset, which we
read in on its own above. Having already looked at it, we know there are
a bunch of columns we don’t need, so we will pull out the
essentials:
# Chunk 4
# dplyr selection
maize <- maize %>% dplyr::select(Item, Area, Element, Year, Value)
# base R (not run)
# maize <- maize[, c("Item", "Area", "Element", "Year", "Value")]Which reduces maize down to the columns Item
(crop name), the Area (country), the Element
(variable, the Year, and the Value of the variable.
Note that we used dplyr::select (note the ::
specification here, due to a namespace conflict with
raster::select) to grab the columns we wanted, instead of
the base R way of selection (shown commented out).
Year and Value store numeric values, but Item, Area, and Element are categorical (character). So let’s look at what values are stored in them:
# Chunk 5
maize %>% distinct(Item, Area, Element)
#> # A tibble: 4 × 3
#> Item Area Element
#> <chr> <chr> <chr>
#> 1 Maize (corn) United Republic of Tanzania Area harvested
#> 2 Maize (corn) United Republic of Tanzania Production
#> 3 Maize (corn) Zambia Area harvested
#> 4 Maize (corn) Zambia Production
# unique(maize[c("Item", "Area", "Element")])We use the distinct function to select out the unique
values contained in each of those columns. You could do this in base
R also, which is also shown in the commented out code.
This exercise tells us something. We have one violation of the tidy principles. Item and Area seem correct, they are storing variables: crop name and country name. Element, on the other hand, contains:
Multiple variables are stored in one column
Which is one of the definitions of messy data.
2.2.3 Reshaping
So we need to reshape the dataset. How? Well, the values “Area harvested” and “Production” in Element are both stored in the neighboring Value column. We need to make two new columns out of values in Value, pulling out the ones corresponding to Element “Production” into one new column, and Element “Area harvested” into another. This is called spreading, or going from long (or tall) to wide:
# Chunk 6
maize <- maize %>% pivot_wider(names_from = Element, values_from = Value)
# maize <- maize %>% spread(key = Element, value = Value)
maize
#> # A tibble: 128 × 5
#> Item Area Year `Area harvested` Production
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Maize (corn) United Republic of Tanzania 1961 790000 590000
#> 2 Maize (corn) United Republic of Tanzania 1962 800000 600000
#> 3 Maize (corn) United Republic of Tanzania 1963 960000 850000
#> 4 Maize (corn) United Republic of Tanzania 1964 930000 720000
#> 5 Maize (corn) United Republic of Tanzania 1965 950000 751000
#> 6 Maize (corn) United Republic of Tanzania 1966 1100000 880000
#> 7 Maize (corn) United Republic of Tanzania 1967 1000000 750000
#> 8 Maize (corn) United Republic of Tanzania 1968 1014000 551000
#> 9 Maize (corn) United Republic of Tanzania 1969 1014000 638000
#> 10 Maize (corn) United Republic of Tanzania 1970 1015000 488000
#> # ℹ 118 more rowsWe use tidyr::pivot_wider to do that, specifying the
Element column as the one that contains the names of the new
columns (“names_from”), and Value as the column from which we
take the values to fill under each column (“values_from”). Note that
this function replaces the commented-out one below it, which is the
original tidyr::spread function. pivot_wider
is considered more intuitive. The inverse procedure is known as
gathering, where columns are re-organized into
key-value pairs:
# Chunk 7
maize %>% pivot_longer(cols = c("Area harvested", "Production"),
names_to = "Element", values_to = "Value")
# maize %>% gather(key = Element, value = Value, `Area harvested`, Production)
#> # A tibble: 256 × 5
#> Item Area Year Element Value
#> <chr> <chr> <dbl> <chr> <dbl>
#> 1 Maize (corn) United Republic of Tanzania 1961 Area harvested 790000
#> 2 Maize (corn) United Republic of Tanzania 1961 Production 590000
#> 3 Maize (corn) United Republic of Tanzania 1962 Area harvested 800000
#> 4 Maize (corn) United Republic of Tanzania 1962 Production 600000
#> 5 Maize (corn) United Republic of Tanzania 1963 Area harvested 960000
#> 6 Maize (corn) United Republic of Tanzania 1963 Production 850000
#> 7 Maize (corn) United Republic of Tanzania 1964 Area harvested 930000
#> 8 Maize (corn) United Republic of Tanzania 1964 Production 720000
#> 9 Maize (corn) United Republic of Tanzania 1965 Area harvested 950000
#> 10 Maize (corn) United Republic of Tanzania 1965 Production 751000
#> # ℹ 246 more rowsHere we use tidyr::pivot_longer (which replaced
tidyr::gather) to reshape maize back to its
original form. We have three arguments, “cols”, which designates the
columns you want to convert to combine into long form, “names_to”, which
specifies the name of the column that will hold the key variable, and
“values_to”, which specifies the name of the column that will hold the
values. The commented out line shows how to do the same thing with the
older gather function. Gathering, also known as going from
wide to long (or tall), is probably more common than spreading, but for
our datasets we have to spread, because the original form keeps two
clearly distinct variables in one column. So we will keep the reshaped
maize.
2.2.4 Renaming
We still need to clean it some more though. Note in the
pivot_longer operation above how Area harvested
has backticks around it. That’s because it has a space in the variable
name, which is bad practice. We also should remove capitals.
# Chunk 8
maize %>% rename(crop = Item, country = Area, year = Year,
harv_area = `Area harvested`, prod = Production)
#> # A tibble: 128 × 5
#> crop country year harv_area prod
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Maize (corn) United Republic of Tanzania 1961 790000 590000
#> 2 Maize (corn) United Republic of Tanzania 1962 800000 600000
#> 3 Maize (corn) United Republic of Tanzania 1963 960000 850000
#> 4 Maize (corn) United Republic of Tanzania 1964 930000 720000
#> 5 Maize (corn) United Republic of Tanzania 1965 950000 751000
#> 6 Maize (corn) United Republic of Tanzania 1966 1100000 880000
#> 7 Maize (corn) United Republic of Tanzania 1967 1000000 750000
#> 8 Maize (corn) United Republic of Tanzania 1968 1014000 551000
#> 9 Maize (corn) United Republic of Tanzania 1969 1014000 638000
#> 10 Maize (corn) United Republic of Tanzania 1970 1015000 488000
#> # ℹ 118 more rowsSo that’s fairly straightforward. We use dplyr::rename
to assign a new name for each column, and we then give them more
informative column names. That’s the most direct way of renaming. There
are more programmatic ways of doing it, but we will circle back to that
later on.
2.2.5 Chaining/piping
Okay, so we have just seen a bunch of data tidying steps above. This
is a good point to talk about combining operations. It is advantageous
to combine operations when we have to do several things to get a desired
outcome, but have no need to keep the results of intermediate steps, as
in this case–we have to make several fixes to these data, but only care
about their final tidied version. We can combine our tidying operations
so that they are executed all at once using
dplyr/tidyr pipes:
# Chunk 9
crops[[1]] %>% dplyr::select(Item, Area, Element, Year, Value) %>%
pivot_wider(names_from = Element, values_from = Value) %>%
rename(crop = Item, country = Area, year = Year,
harv_area = `Area harvested`, prod = Production)
#> # A tibble: 128 × 5
#> crop country year harv_area prod
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Maize (corn) United Republic of Tanzania 1961 790000 590000
#> 2 Maize (corn) United Republic of Tanzania 1962 800000 600000
#> 3 Maize (corn) United Republic of Tanzania 1963 960000 850000
#> 4 Maize (corn) United Republic of Tanzania 1964 930000 720000
#> 5 Maize (corn) United Republic of Tanzania 1965 950000 751000
#> 6 Maize (corn) United Republic of Tanzania 1966 1100000 880000
#> 7 Maize (corn) United Republic of Tanzania 1967 1000000 750000
#> 8 Maize (corn) United Republic of Tanzania 1968 1014000 551000
#> 9 Maize (corn) United Republic of Tanzania 1969 1014000 638000
#> 10 Maize (corn) United Republic of Tanzania 1970 1015000 488000
#> # ℹ 118 more rowsWe grab the first element of crops, the maize
tibble, and apply all three operations sequentially by
chaining them together with %>%, the pipe operator. This
means we are piping the results of each operation to the next: the
results from select are piped to pivot_wider,
and then the results of those two to rename.
Important note: chaining/piping is not something unique to data tidying, as it applies to operations throughout the tidyverse.
Now that we have our pipes set up, we can apply them to all three
datasets in an lapply:
# Chunk 10
lapply(crops, function(x) {
x %>% dplyr::select(Item, Area, Element, Year, Value) %>%
pivot_wider(names_from = Element, values_from = Value) %>%
rename(crop = Item, country = Area, year = Year,
harv_area = `Area harvested`, prod = Production)
})
#> [[1]]
#> # A tibble: 128 × 5
#> crop country year harv_area prod
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Maize (corn) United Republic of Tanzania 1961 790000 590000
#> 2 Maize (corn) United Republic of Tanzania 1962 800000 600000
#> 3 Maize (corn) United Republic of Tanzania 1963 960000 850000
#> 4 Maize (corn) United Republic of Tanzania 1964 930000 720000
#> 5 Maize (corn) United Republic of Tanzania 1965 950000 751000
#> 6 Maize (corn) United Republic of Tanzania 1966 1100000 880000
#> 7 Maize (corn) United Republic of Tanzania 1967 1000000 750000
#> 8 Maize (corn) United Republic of Tanzania 1968 1014000 551000
#> 9 Maize (corn) United Republic of Tanzania 1969 1014000 638000
#> 10 Maize (corn) United Republic of Tanzania 1970 1015000 488000
#> # ℹ 118 more rows
#>
#> [[2]]
#> # A tibble: 128 × 5
#> crop country year harv_area prod
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Sorghum United Republic of Tanzania 1961 200000 180000
#> 2 Sorghum United Republic of Tanzania 1962 175000 175000
#> 3 Sorghum United Republic of Tanzania 1963 162000 178000
#> 4 Sorghum United Republic of Tanzania 1964 173000 135258
#> 5 Sorghum United Republic of Tanzania 1965 192000 149124
#> 6 Sorghum United Republic of Tanzania 1966 310000 161072
#> 7 Sorghum United Republic of Tanzania 1967 310000 143627
#> 8 Sorghum United Republic of Tanzania 1968 315000 163314
#> 9 Sorghum United Republic of Tanzania 1969 310000 145179
#> 10 Sorghum United Republic of Tanzania 1970 310000 171912
#> # ℹ 118 more rows
#>
#> [[3]]
#> # A tibble: 128 × 5
#> crop country year harv_area prod
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Wheat United Republic of Tanzania 1961 8000 6100
#> 2 Wheat United Republic of Tanzania 1962 18000 17578
#> 3 Wheat United Republic of Tanzania 1963 22000 24514
#> 4 Wheat United Republic of Tanzania 1964 25000 27099
#> 5 Wheat United Republic of Tanzania 1965 33000 33019
#> 6 Wheat United Republic of Tanzania 1966 34000 38270
#> 7 Wheat United Republic of Tanzania 1967 31000 35099
#> 8 Wheat United Republic of Tanzania 1968 38000 44000
#> 9 Wheat United Republic of Tanzania 1969 38000 41000
#> 10 Wheat United Republic of Tanzania 1970 60000 57000
#> # ℹ 118 more rowsThe tidying is now done, although we haven’t saved the results to a new object yet. We’ll do that next.
2.2.6 Combining datasets (and operations)
Oftentimes when one is working with data, there is a need to combine
several datasets into one. These three datasets are one such example–we
don’t really need to keep our three tibbles as separate
elements in a list, as it might be easier to perform any further
manipulations of the data if we combine them all into one big
tibble. We’ll look at two ways of joining tables.
2.2.6.1 Binding
Now we’ll bind all three tibbles into a single large
one:
# Chunk 11
crops_df <- do.call(rbind, lapply(crops, function(x) {
x %>% dplyr::select(Item, Area, Element, Year, Value) %>%
pivot_wider(names_from = Element, values_from = Value) %>%
rename(crop = Item, country = Area, year = Year,
harv_area = `Area harvested`, prod = Production)
}))
crops_df
#> # A tibble: 384 × 5
#> crop country year harv_area prod
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Maize (corn) United Republic of Tanzania 1961 790000 590000
#> 2 Maize (corn) United Republic of Tanzania 1962 800000 600000
#> 3 Maize (corn) United Republic of Tanzania 1963 960000 850000
#> 4 Maize (corn) United Republic of Tanzania 1964 930000 720000
#> 5 Maize (corn) United Republic of Tanzania 1965 950000 751000
#> 6 Maize (corn) United Republic of Tanzania 1966 1100000 880000
#> 7 Maize (corn) United Republic of Tanzania 1967 1000000 750000
#> 8 Maize (corn) United Republic of Tanzania 1968 1014000 551000
#> 9 Maize (corn) United Republic of Tanzania 1969 1014000 638000
#> 10 Maize (corn) United Republic of Tanzania 1970 1015000 488000
#> # ℹ 374 more rowsBuilding on Chunk 10, we apply the tidying procedure and then join
the three tidied tibbles into one long one
(crops_df) using do.call and
rbind to do that last step. do.call basically
says “do this function call”, which is rbind,
rbind is being done to the results of the
lapply. Or, broken down:
# Chunk 12
crops2 <- lapply(crops, function(x) {
x %>% dplyr::select(Item, Area, Element, Year, Value) %>%
pivot_wider(names_from = Element, values_from = Value) %>%
rename(crop = Item, country = Area, year = Year,
harv_area = `Area harvested`, prod = Production)
})
crops_df <- do.call(rbind, crops2)
# crops_df <- rbind(crops2[[1]], crops2[[2]], crops2[[3]])
set.seed(1)
crops_df %>% sample_n(., size = 20)
#> # A tibble: 20 × 5
#> crop country year harv_area prod
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Wheat Zambia 1964 145 214
#> 2 Sorghum United Republic of Tanzania 1999 659868 561031
#> 3 Sorghum United Republic of Tanzania 1961 200000 180000
#> 4 Wheat United Republic of Tanzania 2003 26890 74000
#> 5 Wheat United Republic of Tanzania 1974 46000 82000
#> 6 Sorghum United Republic of Tanzania 2019 646868 731877
#> 7 Wheat United Republic of Tanzania 2011 108287 112658
#> 8 Maize (corn) Zambia 1981 493783 1007280
#> 9 Wheat United Republic of Tanzania 1981 51000 95000
#> 10 Wheat Zambia 2002 12000 75000
#> 11 Wheat Zambia 1970 100 101
#> 12 Wheat United Republic of Tanzania 1967 31000 35099
#> 13 Wheat Zambia 1969 100 220
#> 14 Maize (corn) Zambia 1975 1020000 1483133
#> 15 Sorghum Zambia 1981 21500 13950
#> 16 Maize (corn) United Republic of Tanzania 1997 1564000 1831200
#> 17 Maize (corn) Zambia 2001 582000 802000
#> 18 Sorghum Zambia 1985 24811 20227
#> 19 Wheat Zambia 2006 17100 93958
#> 20 Sorghum United Republic of Tanzania 1997 622400 538200The above shows the lapply and the
do.call(rbind steps separately. Note the commented out
line, which shows an alternate, less programmatic way of binding the
three tables. The final line uses dplyr::sample_n to select
20 rows at random, which reveals that the new large tibble
contains observations of all three crop types.
Another note: there is actually a pure tidyverse way of doing the
full set of steps, using purrr’s set of map*
functions to replace the lapply, but we will leave that
aside for now.
2.2.6.2 Merging/joining
So we just saw how to combine datasets by binding them by rows.
Oftentimes you want to combine them by adding a variable or variables
from one dataset into another, using one or common variables as the
key(s) for joining. In base R, this is done with the
merge function, but with dplyr we use the
*_join functions.
We’ll start with some dummy datasets to illustrate, and then use our own crop data to show a more complicated join using two variables.
# Chunk 13a
set.seed(1)
t1 <- tibble(v1 = paste0("N", 1:5), v2 = rnorm(5))
t1
#> # A tibble: 5 × 2
#> v1 v2
#> <chr> <dbl>
#> 1 N1 -0.626
#> 2 N2 0.184
#> 3 N3 -0.836
#> 4 N4 1.60
#> 5 N5 0.330
t2 <- tibble(v1 = paste0("N", 1:5), v3 = runif(5))
t2
#> # A tibble: 5 × 2
#> v1 v3
#> <chr> <dbl>
#> 1 N1 0.206
#> 2 N2 0.177
#> 3 N3 0.687
#> 4 N4 0.384
#> 5 N5 0.770
t3 <- tibble(v1 = paste0("N", 1:7), v4 = sample(1:100, 7))
# v5 = letters[sample(1:26, 7)])
t3
#> # A tibble: 7 × 2
#> v1 v4
#> <chr> <int>
#> 1 N1 54
#> 2 N2 74
#> 3 N3 7
#> 4 N4 73
#> 5 N5 79
#> 6 N6 85
#> 7 N7 37
t4 <- tibble(v1 = paste0("N", c(1:2, 4:7, 11)),
v5 = letters[sample(1:26, 7)])
t4
#> # A tibble: 7 × 2
#> v1 v5
#> <chr> <chr>
#> 1 N1 i
#> 2 N2 y
#> 3 N4 n
#> 4 N5 e
#> 5 N6 w
#> 6 N7 b
#> 7 N11 jWe make four tibbles, each having a common variable
v1 but three unique variables (v2, v3,
v4, v5).
There are several different ways to do joins, and a matching function
for each. Please read about those functions
(?dplyr::join).
The simplest join case is between t1 and
t2, which can be done as:
# Chunk 13b
# #1
left_join(x = t1, y = t2, by = "v1")
#> # A tibble: 5 × 3
#> v1 v2 v3
#> <chr> <dbl> <dbl>
#> 1 N1 -0.626 0.206
#> 2 N2 0.184 0.177
#> 3 N3 -0.836 0.687
#> 4 N4 1.60 0.384
#> 5 N5 0.330 0.770
#
# #2
inner_join(x = t1, y = t2, by = "v1")
#> # A tibble: 5 × 3
#> v1 v2 v3
#> <chr> <dbl> <dbl>
#> 1 N1 -0.626 0.206
#> 2 N2 0.184 0.177
#> 3 N3 -0.836 0.687
#> 4 N4 1.60 0.384
#> 5 N5 0.330 0.770
#
# #3
right_join(x = t1, y = t2, by = "v1")
#> # A tibble: 5 × 3
#> v1 v2 v3
#> <chr> <dbl> <dbl>
#> 1 N1 -0.626 0.206
#> 2 N2 0.184 0.177
#> 3 N3 -0.836 0.687
#> 4 N4 1.60 0.384
#> 5 N5 0.330 0.770
#
# #
full_join(x = t1, y = t2, by = "v1")
#> # A tibble: 5 × 3
#> v1 v2 v3
#> <chr> <dbl> <dbl>
#> 1 N1 -0.626 0.206
#> 2 N2 0.184 0.177
#> 3 N3 -0.836 0.687
#> 4 N4 1.60 0.384
#> 5 N5 0.330 0.770We use inner_join, left_join,
right_join, or full_join to merge together
t1 and t2, getting identical results because
the two joins have have the same number of rows and a unique key value
per row on the join variable (v1). The results begin to diverge
when we have differing numbers of rows:
# Chunk 14
# #1
inner_join(x = t1, y = t3, by = "v1")
#> # A tibble: 5 × 3
#> v1 v2 v4
#> <chr> <dbl> <int>
#> 1 N1 -0.626 54
#> 2 N2 0.184 74
#> 3 N3 -0.836 7
#> 4 N4 1.60 73
#> 5 N5 0.330 79
#
# #2
left_join(x = t1, y = t3, by = "v1")
#> # A tibble: 5 × 3
#> v1 v2 v4
#> <chr> <dbl> <int>
#> 1 N1 -0.626 54
#> 2 N2 0.184 74
#> 3 N3 -0.836 7
#> 4 N4 1.60 73
#> 5 N5 0.330 79
#
# #3
right_join(x = t1, y = t3, by = "v1")
#> # A tibble: 7 × 3
#> v1 v2 v4
#> <chr> <dbl> <int>
#> 1 N1 -0.626 54
#> 2 N2 0.184 74
#> 3 N3 -0.836 7
#> 4 N4 1.60 73
#> 5 N5 0.330 79
#> 6 N6 NA 85
#> 7 N7 NA 37
#
# #4
full_join(x = t1, y = t3, by = "v1")
#> # A tibble: 7 × 3
#> v1 v2 v4
#> <chr> <dbl> <int>
#> 1 N1 -0.626 54
#> 2 N2 0.184 74
#> 3 N3 -0.836 7
#> 4 N4 1.60 73
#> 5 N5 0.330 79
#> 6 N6 NA 85
#> 7 N7 NA 37In #1 and #2, t1 and t3 produce identical
results using either inner_join or left_join,
even though they do slightly different things (note: quotes are from
older documentation):
?inner_join:…return all rows from x where there are matching values in y, and all columns from x and y.
?left_join:…return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns.
Since t1 is the x object, and t3
is the y, both join functions drop the two extra rows in
t3 because they have no matches in t1.
That contrasts with #3, where we use a right_join, which
returns (from ?right_join):
all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns.
So we the unmatched rows from t3 (the last two) in the
joined result are filled with NAs in the v2 column that came in
from t1. full_join produces the same result,
because it returns (from ?full_join):
return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.
You get the same results in the last two because the values in
t1’s v1 column are a subset of t3’s
v1 values. Let’s see what happens when both the values in both
join columns each have values not shared by the other:
# Chunk 15a
# #1
inner_join(x = t3, y = t4, by = "v1")
#> # A tibble: 6 × 3
#> v1 v4 v5
#> <chr> <int> <chr>
#> 1 N1 54 i
#> 2 N2 74 y
#> 3 N4 73 n
#> 4 N5 79 e
#> 5 N6 85 w
#> 6 N7 37 b
#
# #2
left_join(x = t3, y = t4, by = "v1")
#> # A tibble: 7 × 3
#> v1 v4 v5
#> <chr> <int> <chr>
#> 1 N1 54 i
#> 2 N2 74 y
#> 3 N3 7 <NA>
#> 4 N4 73 n
#> 5 N5 79 e
#> 6 N6 85 w
#> 7 N7 37 b
#
# #3
right_join(x = t3, y = t4, by = "v1")
#> # A tibble: 7 × 3
#> v1 v4 v5
#> <chr> <int> <chr>
#> 1 N1 54 i
#> 2 N2 74 y
#> 3 N4 73 n
#> 4 N5 79 e
#> 5 N6 85 w
#> 6 N7 37 b
#> 7 N11 NA j
#
# #4
full_join(x = t3, y = t4, by = "v1")
#> # A tibble: 8 × 3
#> v1 v4 v5
#> <chr> <int> <chr>
#> 1 N1 54 i
#> 2 N2 74 y
#> 3 N3 7 <NA>
#> 4 N4 73 n
#> 5 N5 79 e
#> 6 N6 85 w
#> 7 N7 37 b
#> 8 N11 NA jEach of the four results produces a different output (and it is left to you to describe why in the practice questions below).
Right, so that is an introduction to some different types of joins, and their varying behavior. We can join all four together using pipes:
# Chunk 15b
full_join(t1, t2) %>% full_join(., t3) %>% full_join(., t4)
#> Joining with `by = join_by(v1)`
#> Joining with `by = join_by(v1)`
#> Joining with `by = join_by(v1)`
#> # A tibble: 8 × 5
#> v1 v2 v3 v4 v5
#> <chr> <dbl> <dbl> <int> <chr>
#> 1 N1 -0.626 0.206 54 i
#> 2 N2 0.184 0.177 74 y
#> 3 N3 -0.836 0.687 7 <NA>
#> 4 N4 1.60 0.384 73 n
#> 5 N5 0.330 0.770 79 e
#> 6 N6 NA NA 85 w
#> 7 N7 NA NA 37 b
#> 8 N11 NA NA NA jThat is the most inclusive join of all four tibbles.
Notice that we dropped the argument names, and the join variable is
automatically found.
We are now going to end by doing a more complicated join based on three columns, using our crop data:
# Chunk 16
# #1
yields <- crops_df %>%
mutate(yield = prod / harv_area) %>%
dplyr::select(crop, country, year, yield)
yields %>% slice(1:2)
#> # A tibble: 2 × 4
#> crop country year yield
#> <chr> <chr> <dbl> <dbl>
#> 1 Maize (corn) United Republic of Tanzania 1961 0.747
#> 2 Maize (corn) United Republic of Tanzania 1962 0.75
#
# #2
crop_ylds <- left_join(x = crops_df, y = yields,
by = c("crop", "country", "year"))
#
# 3
crop_ylds <- left_join(crops_df, yields)
#> Joining with `by = join_by(crop, country, year)`
crop_ylds
#> # A tibble: 384 × 6
#> crop country year harv_area prod yield
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Maize (corn) United Republic of Tanzania 1961 790000 590000 0.747
#> 2 Maize (corn) United Republic of Tanzania 1962 800000 600000 0.75
#> 3 Maize (corn) United Republic of Tanzania 1963 960000 850000 0.885
#> 4 Maize (corn) United Republic of Tanzania 1964 930000 720000 0.774
#> 5 Maize (corn) United Republic of Tanzania 1965 950000 751000 0.791
#> 6 Maize (corn) United Republic of Tanzania 1966 1100000 880000 0.8
#> 7 Maize (corn) United Republic of Tanzania 1967 1000000 750000 0.75
#> 8 Maize (corn) United Republic of Tanzania 1968 1014000 551000 0.543
#> 9 Maize (corn) United Republic of Tanzania 1969 1014000 638000 0.629
#> 10 Maize (corn) United Republic of Tanzania 1970 1015000 488000 0.481
#> # ℹ 374 more rowsFirst, we make a new tibble that calculates crop yield
from production and harvested area (yield = production / harvest area).
We use the mutate function (more on that in section 2.2.8)
to create the new variable, and select just crop,
country, year, and yield to make a new
yields tibble.
We then use a left_join to merge yields
onto crops_df. We specify three columns in this case,
because there is no one variable that provides a unique key, nor is
there two variables that do. However, the values in crop,
country, and year do provide a unique combination for
each row, so we join on those. #2 makes that explicit, but we see in #3
that left_join handles this for us automatically if we
don’t.
2.2.7 Arranging
That last line on Chunk #12
(crops_df %>% sample_n(., size = 20)) gives an
opportunity to introduce another aspect of data organizing, which is
sorting. Note that the years are out of order in that final random draw.
We can rearrange that so that they are sequential:
# Chunk 17
set.seed(1)
#
# #1
crops_df %>%
sample_n(., size = 20) %>%
arrange(year)
#> # A tibble: 20 × 5
#> crop country year harv_area prod
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Sorghum United Republic of Tanzania 1961 200000 180000
#> 2 Wheat Zambia 1964 145 214
#> 3 Wheat United Republic of Tanzania 1967 31000 35099
#> 4 Wheat Zambia 1969 100 220
#> 5 Wheat Zambia 1970 100 101
#> 6 Wheat United Republic of Tanzania 1974 46000 82000
#> 7 Maize (corn) Zambia 1975 1020000 1483133
#> 8 Maize (corn) Zambia 1981 493783 1007280
#> 9 Wheat United Republic of Tanzania 1981 51000 95000
#> 10 Sorghum Zambia 1981 21500 13950
#> 11 Sorghum Zambia 1985 24811 20227
#> 12 Maize (corn) United Republic of Tanzania 1997 1564000 1831200
#> 13 Sorghum United Republic of Tanzania 1997 622400 538200
#> 14 Sorghum United Republic of Tanzania 1999 659868 561031
#> 15 Maize (corn) Zambia 2001 582000 802000
#> 16 Wheat Zambia 2002 12000 75000
#> 17 Wheat United Republic of Tanzania 2003 26890 74000
#> 18 Wheat Zambia 2006 17100 93958
#> 19 Wheat United Republic of Tanzania 2011 108287 112658
#> 20 Sorghum United Republic of Tanzania 2019 646868 731877
#
# #2
crops_df %>%
sample_n(., size = 20) %>%
arrange(country, year)
#> # A tibble: 20 × 5
#> crop country year harv_area prod
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Maize (corn) United Republic of Tanzania 1980 1400000 1726000
#> 2 Maize (corn) United Republic of Tanzania 1985 1576280 2093000
#> 3 Wheat United Republic of Tanzania 1993 43965 59000
#> 4 Wheat United Republic of Tanzania 1994 48800 59000
#> 5 Maize (corn) United Republic of Tanzania 2000 1017600 1965400
#> 6 Maize (corn) United Republic of Tanzania 2002 1718200 4408420
#> 7 Maize (corn) United Republic of Tanzania 2004 3173070 4651370
#> 8 Sorghum United Republic of Tanzania 2004 697220 648540
#> 9 Wheat Zambia 1966 41 81
#> 10 Maize (corn) Zambia 1966 820000 770000
#> 11 Sorghum Zambia 1966 72000 47000
#> 12 Wheat Zambia 1970 100 101
#> 13 Wheat Zambia 1980 2400 9765
#> 14 Wheat Zambia 1983 3044 10936
#> 15 Maize (corn) Zambia 1985 581846 1122351
#> 16 Wheat Zambia 2002 12000 75000
#> 17 Maize (corn) Zambia 2007 585291 1366158
#> 18 Sorghum Zambia 2016 26673 14106
#> 19 Maize (corn) Zambia 2017 1433944 3606549
#> 20 Wheat Zambia 2022 33568 234925.
#
# #3
crops_df %>%
sample_n(., size = 20) %>%
arrange(crop, country, year)
#> # A tibble: 20 × 5
#> crop country year harv_area prod
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Maize (corn) United Republic of Tanzania 1974 1100000 761000
#> 2 Maize (corn) United Republic of Tanzania 1982 1231550 1654000
#> 3 Maize (corn) United Republic of Tanzania 1999 957550 2420940
#> 4 Maize (corn) United Republic of Tanzania 2005 3109590 3131610
#> 5 Maize (corn) Zambia 1999 597454 822056
#> 6 Maize (corn) Zambia 2000 586907 1040000
#> 7 Sorghum United Republic of Tanzania 1962 175000 175000
#> 8 Sorghum United Republic of Tanzania 1992 683071 587128
#> 9 Sorghum Zambia 1961 67000 40000
#> 10 Sorghum Zambia 1974 60000 23140
#> 11 Sorghum Zambia 1998 35864 25399
#> 12 Sorghum Zambia 2023 13749 6836.
#> 13 Wheat United Republic of Tanzania 1984 52010 83000
#> 14 Wheat United Republic of Tanzania 2002 30670 77000
#> 15 Wheat Zambia 1966 41 81
#> 16 Wheat Zambia 1971 100 100
#> 17 Wheat Zambia 2010 27291 172256
#> 18 Wheat Zambia 2011 37631 237332
#> 19 Wheat Zambia 2012 37209 253522
#> 20 Wheat Zambia 2018 21675 114463.
#
# #4
crops_df %>%
sample_n(., size = 20) %>%
arrange(crop, country, -year)
#> # A tibble: 20 × 5
#> crop country year harv_area prod
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Maize (corn) United Republic of Tanzania 2005 3109590 3131610
#> 2 Maize (corn) United Republic of Tanzania 1993 1824000 2282200
#> 3 Maize (corn) United Republic of Tanzania 1989 1980000 3128000
#> 4 Maize (corn) United Republic of Tanzania 1973 1000000 887000
#> 5 Maize (corn) Zambia 2006 618959 1424400
#> 6 Maize (corn) Zambia 2004 631000 1214000
#> 7 Maize (corn) Zambia 1998 510372 638134
#> 8 Maize (corn) Zambia 1980 555410 937266
#> 9 Sorghum United Republic of Tanzania 2008 566760 551270
#> 10 Sorghum United Republic of Tanzania 1981 700000 425000
#> 11 Sorghum United Republic of Tanzania 1973 353749 192406
#> 12 Sorghum Zambia 2020 35222 20011.
#> 13 Sorghum Zambia 1989 52008 33757
#> 14 Sorghum Zambia 1985 24811 20227
#> 15 Wheat United Republic of Tanzania 2008 43160 43360
#> 16 Wheat United Republic of Tanzania 2000 71700 32700
#> 17 Wheat United Republic of Tanzania 1991 50300 84000
#> 18 Wheat United Republic of Tanzania 1983 47620 74000
#> 19 Wheat Zambia 1999 9921 69226
#> 20 Wheat Zambia 1985 5000 17156We are taking the same random sample in each example above, ordering
the resulting outputs differently in each case: #1 orders ascending by
year (oldest to most recent); #2 orders ascending first by country than
by year; #3 ascending by crop, country, and year; #4 by ascending by
crop and country, but descending by year (note the -
sign).
We can also sort our crop dataset:
# Chunk 18
# #1
crops_df %>% arrange(desc(year), country, desc(crop))
#> # A tibble: 384 × 5
#> crop country year harv_area prod
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Wheat United Republic of Tanzania 2024 86500 119000
#> 2 Sorghum United Republic of Tanzania 2024 750000 800000
#> 3 Maize (corn) United Republic of Tanzania 2024 5741681 10084000
#> 4 Wheat Zambia 2024 28052 198886.
#> 5 Sorghum Zambia 2024 4524 2865.
#> 6 Maize (corn) Zambia 2024 684402 1511143.
#> 7 Wheat United Republic of Tanzania 2023 60000 86522
#> 8 Sorghum United Republic of Tanzania 2023 683967 737819
#> 9 Maize (corn) United Republic of Tanzania 2023 4200000 8010949
#> 10 Wheat Zambia 2023 36551 277492.
#> # ℹ 374 more rows
#
# #2
crops_df %>% arrange(-year, country, -crop)
#> Error in `arrange()`:
#> ℹ In argument: `..3 = -crop`.
#> Caused by error in `-crop`:
#> ! invalid argument to unary operator
# crops_df %>% arrange(-year, country, desc(crop))In #1 we sort first by year, in descending order
(desc() does the same as -, meaning it
arranges in descending order), then country alphabetically, and
then by crop in reverse alphabetical order. Note that in #2 we
can’t use the - to do the reverse sort on crop,
but the first - on year is valid. Because the data type of
year is numeric but is character for
crop. The commented out variant would work if run.
2.2.8 Mutating
The last thing we will look at in this section is mutation,
which uses the mutate function. That is, we use it to
change the data.frame tibble by either altering existing
variables or adding new ones. This brings us back to our
crops_df. In Chunk 16 we created a new yields
tibble so that we could demonstrate a merge with our
crops_df. That was an entirely unnecessary operation,
because if I wanted to create a separate yield column for
crops_df, I could have just done this:
# Chunk 19
crop_ylds <- crops_df %>% mutate(yield = prod / harv_area)
crop_ylds
#> # A tibble: 384 × 6
#> crop country year harv_area prod yield
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Maize (corn) United Republic of Tanzania 1961 790000 590000 0.747
#> 2 Maize (corn) United Republic of Tanzania 1962 800000 600000 0.75
#> 3 Maize (corn) United Republic of Tanzania 1963 960000 850000 0.885
#> 4 Maize (corn) United Republic of Tanzania 1964 930000 720000 0.774
#> 5 Maize (corn) United Republic of Tanzania 1965 950000 751000 0.791
#> 6 Maize (corn) United Republic of Tanzania 1966 1100000 880000 0.8
#> 7 Maize (corn) United Republic of Tanzania 1967 1000000 750000 0.75
#> 8 Maize (corn) United Republic of Tanzania 1968 1014000 551000 0.543
#> 9 Maize (corn) United Republic of Tanzania 1969 1014000 638000 0.629
#> 10 Maize (corn) United Republic of Tanzania 1970 1015000 488000 0.481
#> # ℹ 374 more rowsThereby avoiding the joining process entirely and doing in one step
what we did in two steps in Chunk 16. That is courtesy of the
mutate function.
We can also use mutate to help us modify existing rows.
Let’s do that. I don’t like having the full country name in the
tibble, particularly with spaces in the name. Let’s shorten
those.
# Chunk 20
set.seed(1)
crop_ylds %>%
mutate(country = ifelse(country == "United Republic of Tanzania", "TZA", country),
country = ifelse(country == "Zambia", "ZMB", country)) %>%
sample_n(5)
#> # A tibble: 5 × 6
#> crop country year harv_area prod yield
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Wheat ZMB 1964 145 214 1.48
#> 2 Sorghum TZA 1999 659868 561031 0.850
#> 3 Sorghum TZA 1961 200000 180000 0.9
#> 4 Wheat TZA 2003 26890 74000 2.75
#> 5 Wheat TZA 1974 46000 82000 1.78
set.seed(1)
crop_ylds %>%
mutate(country = ifelse(country == "United Republic of Tanzania", "TZA", country)) %>%
mutate(country = ifelse(country == "Zambia", "ZMB", country)) %>%
sample_n(5)
#> # A tibble: 5 × 6
#> crop country year harv_area prod yield
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Wheat ZMB 1964 145 214 1.48
#> 2 Sorghum TZA 1999 659868 561031 0.850
#> 3 Sorghum TZA 1961 200000 180000 0.9
#> 4 Wheat TZA 2003 26890 74000 2.75
#> 5 Wheat TZA 1974 46000 82000 1.78In #1 we use the mutate function with
ifelse to change all values that match “United Republic of
Tanzania” to “TZA”, otherwise let them be, and all values matching
“Zambia” to “ZMB”, otherwise leave them be. In this case, we wrap both
arguments in a single mutate call, separated by a comma. We
could also use separate mutate calls for each change, as we
do in #2. In both cases we sample 5 rows to show that the changes were
made for both countries.
We’ll wrap it all up now:
# Chunk 21
crop_ylds <- crop_ylds %>%
mutate(country = ifelse(country == "United Republic of Tanzania", "TZA", country)) %>%
mutate(country = ifelse(country == "Zambia", "ZMB", country)) %>%
mutate(crop = tolower(crop)) %>%
mutate(crop = ifelse(crop == "maize (corn)", "maize", crop))
set.seed(1)
crop_ylds %>% sample_n(5)
#> # A tibble: 5 × 6
#> crop country year harv_area prod yield
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 wheat ZMB 1964 145 214 1.48
#> 2 sorghum TZA 1999 659868 561031 0.850
#> 3 sorghum TZA 1961 200000 180000 0.9
#> 4 wheat TZA 2003 26890 74000 2.75
#> 5 wheat TZA 1974 46000 82000 1.78We change all the values in country, as before, then change the crop names to lowercase, and clean the name of maize all in a single statement.
There are many other ways to mutate data, using more
advanced pattern matching capabilities and other features. We will start
to see more of those in subsequent sections.
2.3 Practice
2.3.1 Questions
What is the difference between a
tibbleand adata.frame?Given a
tibbletb_a, with columns a and b, how would extract column a using baseRmethods? Using tidyverse methods?You are given a csv that is organized into 6 rows and 14 columns. The first two columns contain country names (two countries) and district names (three districts per country). The other 12 columns are named by the month of th year, and each month contains the average number of measles cases reported over a 5 year period in that month for each district in each country. Are these data tidy or messy? If messy, how would you rearrange them?
Describe why each of the four outputs from Chunk 15a differs as a function of the
*_joinfunction used.
2.3.2 Code
Re-create the dummy
tibblein Chunk 3, and write it out to a csv usingreadr::write_csv, to any directory convenient for you (e.g. into your notebooks/ folder), calling it “my_dummy_file.csv”.Use
dirto list files in the directory you write it into matching the word “dummy”, and then read it back in usingreadr::read_csv.Building on the example in Chunk 15a, join all 4
tibbles together, but replacefull_joinwithleft_join. Do the same again usingright_join.Rearrange the results of the two joins you just did, using v5 as the sort. Do an ascending sort on the results of the chained
left_join, and descending on theright_join.Recreate the
crop_yldstibble. Skip the part where we created yield via a merge, and instead use the approach where we just did a mutate. Don’t forget to change the crop capitalization and change the country names also. After you reproduce what it looks like in Chunk 21, usemutateto add a new column harv_km2 by doing harv_area / 100.Now rename the new harv_km2 variable to harv_area_km2, using the
renamefunction.Do all of this in a single pipeline: 1) create a
tibblewith variables v1 (values 1:10) and v2 (values 11:20); 2)rbindthat with anothertibblewith the same variables but with values 11:20 and 20:30 (hint: you will need to wrap the secondtibblein anrbind, e.g.rbind(., tibble())); 3) add a new column v3 to the result of that, which is the square of v2; 4) arrange it in descending order of v3; 5) capture the result of all that intomy_tbEnd by selecting rows 1, 10, and 17 and v2 and v3 from
my_tb, usingsliceandselect
3 Analyzing your data
Now that we have gotten an introduction to data preparation steps, we will learn about how to analyze it. We have already seen some very basic analyses, in the form of basic algebra and data summaries. Now we will move onto more advanced data manipulation and analyses.
3.1 Split-apply-combine
Many data analysis problems involve the application of a split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.
So says Hadley Wickham in a paper that
describes this common data analysis strategy and the plyr
package, which is the forerunner of dplyr.
Long or tall (i.e. tidy) data formats facilitate split-apply-combine
(SAC) analyses. Let’s come back to our crop_ylds dataset to
look at this:
# Chunk 22
crop_ylds %>%
group_by(crop) %>%
summarize(mean_yield = mean(yield))
#> # A tibble: 3 × 2
#> crop mean_yield
#> <chr> <dbl>
#> 1 maize 1.54
#> 2 sorghum 0.780
#> 3 wheat 3.04We just did an SAC on the crop_ylds data that entailed
splitting the dataset by crop type, applying the mean
function, and then combining the results of that into a summary
tibble. We can use an lapply and
rbind to illustrate the inner workings of that SAC:
# Chunk 23
# #1
splt_apply <- lapply(unique(crop_ylds$crop), function(x) {
dat <- crop_ylds[crop_ylds$crop == x, ] # here's the split
cat("\n") # add empty line
print(paste("Split out", x)) # print statement to mark split
print(dat[1, ]) # print showing first line of split tibble
o <- data.frame(crop = x, mean_yield = mean(dat$yield)) # here the apply
})
#>
#> [1] "Split out maize"
#> # A tibble: 1 × 6
#> crop country year harv_area prod yield
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 maize TZA 1961 790000 590000 0.747
#>
#> [1] "Split out sorghum"
#> # A tibble: 1 × 6
#> crop country year harv_area prod yield
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 sorghum TZA 1961 200000 180000 0.9
#>
#> [1] "Split out wheat"
#> # A tibble: 1 × 6
#> crop country year harv_area prod yield
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 wheat TZA 1961 8000 6100 0.762
#
# #2
do.call(rbind, splt_apply) # here's the combine
#> crop mean_yield
#> 1 maize 1.538175
#> 2 sorghum 0.779559
#> 3 wheat 3.035511Chunk 23 #1 contains both the split (using base R syntax
rather than dplyr, to avoid mixing syntaxes) and apply
parts (see the comments in the code), with some extra print statements
added so that we can see how the splits are happening within the
lapply (by showing the first row of each split-out
tibble. Note how unique is used
here–to get a vector of unique crop names that are passed to an
anonymous function that iterates over each crop name, which is used with
logical indexing to subset (split) out the rows from
crop_ylds that correspond to that crop. #2 does
the combine, creating the same result produced using dplyr
in Chunk 22. Given that, you will not dplyr::group_by does
the splitting work, and dplyr::summarize does the applying
(of mean), and the combine is implicit.
Now let’s look at a more complex SAC:
# Chunk 24
crop_ylds %>%
group_by(crop, country) %>%
summarize(mean_yield = mean(yield))
#> `summarise()` has grouped output by 'crop'. You can override using the `.groups` argument.
#> # A tibble: 6 × 3
#> # Groups: crop [3]
#> crop country mean_yield
#> <chr> <chr> <dbl>
#> 1 maize TZA 1.32
#> 2 maize ZMB 1.75
#> 3 sorghum TZA 0.904
#> 4 sorghum ZMB 0.655
#> 5 wheat TZA 1.38
#> 6 wheat ZMB 4.69Here we do two splits: the first on crop, the second on country, so
after applying and combining we get the mean yield for each crop for
each country. Let’s do that with lapply:
# Chunk 25a
# #1
splt_apply <- lapply(unique(crop_ylds$crop), function(x) {
cat("\n") # add empty line
print(paste("Split out", x)) # print to mark outer split
cntry_splt_apply <- lapply(unique(crop_ylds$country), function(y) {
dat <- crop_ylds[crop_ylds$crop == x & crop_ylds$country == y, ] # split
cat("\n") # add empty line
print(paste("...then split out", y, x)) # print to mark inner split
print(dat[1, ]) # print to show 1st line of split tibble
o <- data.frame(crop = x, country = y, mean_yield = mean(dat$yield))
})
do.call(rbind, cntry_splt_apply)
})
#>
#> [1] "Split out maize"
#>
#> [1] "...then split out TZA maize"
#> # A tibble: 1 × 6
#> crop country year harv_area prod yield
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 maize TZA 1961 790000 590000 0.747
#>
#> [1] "...then split out ZMB maize"
#> # A tibble: 1 × 6
#> crop country year harv_area prod yield
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 maize ZMB 1961 750000 660000 0.88
#>
#> [1] "Split out sorghum"
#>
#> [1] "...then split out TZA sorghum"
#> # A tibble: 1 × 6
#> crop country year harv_area prod yield
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 sorghum TZA 1961 200000 180000 0.9
#>
#> [1] "...then split out ZMB sorghum"
#> # A tibble: 1 × 6
#> crop country year harv_area prod yield
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 sorghum ZMB 1961 67000 40000 0.597
#>
#> [1] "Split out wheat"
#>
#> [1] "...then split out TZA wheat"
#> # A tibble: 1 × 6
#> crop country year harv_area prod yield
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 wheat TZA 1961 8000 6100 0.762
#>
#> [1] "...then split out ZMB wheat"
#> # A tibble: 1 × 6
#> crop country year harv_area prod yield
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 wheat ZMB 1961 380 608 1.6
#
# #2
do.call(rbind, splt_apply)
#> crop country mean_yield
#> 1 maize TZA 1.3240753
#> 2 maize ZMB 1.7522742
#> 3 sorghum TZA 0.9038347
#> 4 sorghum ZMB 0.6552833
#> 5 wheat TZA 1.3769024
#> 6 wheat ZMB 4.6941196Making this SAC with lapply requires an
lapply nested in an lapply. The outer
lapply splits on crop, and the inner on each country within
each crop (using unique(crop_ylds$country) to get the
vector of country names). mean is applied to the split data
within the inner lapply.
Right, so that might look daunting, but not to worry–we are just
using those lapplys to illustrate the splitting, and also
to show how using dplyr pipelines can make things much
simpler. So we’ll forge ahead with those.
What happens if we want to do SAC on an even finer range of data within each split, say, calculate mean yield since the year 2000 by crop and by country:
# Chunk 25b
crop_ylds %>% filter(year >= 2000) %>% group_by(crop, country) %>%
summarize(mean_yield = mean(yield))
#> `summarise()` has grouped output by 'crop'. You can override using the `.groups` argument.
#> # A tibble: 6 × 3
#> # Groups: crop [3]
#> crop country mean_yield
#> <chr> <chr> <dbl>
#> 1 maize TZA 1.60
#> 2 maize ZMB 2.29
#> 3 sorghum TZA 1.01
#> 4 sorghum ZMB 0.707
#> 5 wheat TZA 1.39
#> 6 wheat ZMB 6.67We can add dplyr::filter with the conditional statement
(year >= 2000) in it, to select observations since the
year 2000, and do the SAC on those. According to
?dplyr::filter’s function description, we:
Use filter() to choose rows/cases where conditions are true. Unlike base subsetting with [, rows where the condition evaluates to NA are dropped.
How about pre-2000 mean yields?
# Chunk 26
crop_ylds %>% filter(year < 2000) %>% group_by(crop, country) %>%
summarize(mean_yield = mean(yield))
#> `summarise()` has grouped output by 'crop'. You can override using the `.groups` argument.
#> # A tibble: 6 × 3
#> # Groups: crop [3]
#> crop country mean_yield
#> <chr> <chr> <dbl>
#> 1 maize TZA 1.15
#> 2 maize ZMB 1.41
#> 3 sorghum TZA 0.839
#> 4 sorghum ZMB 0.622
#> 5 wheat TZA 1.37
#> 6 wheat ZMB 3.43Same procedure, just a different conditional operator inside
filter. But maybe we want to compare pre- and post-2000
yields in the same dataset:
# Chunk 27
crop_ylds %>%
mutate(y2k = ifelse(year < 2000, "1961_2000", "2000_2024")) %>%
group_by(crop, country, y2k) %>%
summarize(mean_yield = mean(yield))
#> `summarise()` has grouped output by 'crop', 'country'. You can override using the `.groups`
#> argument.
#> # A tibble: 12 × 4
#> # Groups: crop, country [6]
#> crop country y2k mean_yield
#> <chr> <chr> <chr> <dbl>
#> 1 maize TZA 1961_2000 1.15
#> 2 maize TZA 2000_2024 1.60
#> 3 maize ZMB 1961_2000 1.41
#> 4 maize ZMB 2000_2024 2.29
#> 5 sorghum TZA 1961_2000 0.839
#> 6 sorghum TZA 2000_2024 1.01
#> 7 sorghum ZMB 1961_2000 0.622
#> 8 sorghum ZMB 2000_2024 0.707
#> 9 wheat TZA 1961_2000 1.37
#> 10 wheat TZA 2000_2024 1.39
#> 11 wheat ZMB 1961_2000 3.43
#> 12 wheat ZMB 2000_2024 6.67That’s a little more complex, and doesn’t use filter.
Instead, we first use mutate with ifelse to
create a new variable (y2k), which indicates whether years are
before 2000, or 2000 and onward. We then use y2k to apply a
third split to the data after our crop and country splits, and finally
end with the mean yields from our three-way splits.
That comparison might be easier if we put the pre- and post-2000 yields next to one another. That would require just one more step to do:
# Chunk 28
crop_ylds %>%
mutate(y2k = ifelse(year < 2000, "1961_2000", "2000_2024")) %>%
group_by(crop, country, y2k) %>%
summarize(mean_yield = mean(yield)) %>%
pivot_wider(names_from = y2k, values_from = mean_yield)
#> `summarise()` has grouped output by 'crop', 'country'. You can override using the `.groups`
#> argument.
#> # A tibble: 6 × 4
#> # Groups: crop, country [6]
#> crop country `1961_2000` `2000_2024`
#> <chr> <chr> <dbl> <dbl>
#> 1 maize TZA 1.15 1.60
#> 2 maize ZMB 1.41 2.29
#> 3 sorghum TZA 0.839 1.01
#> 4 sorghum ZMB 0.622 0.707
#> 5 wheat TZA 1.37 1.39
#> 6 wheat ZMB 3.43 6.67In this case we just use pivot_wider with y2K
as the key, and the mean yields as the value. Much easier to compare, in
my opinion, and it shows how much larger post 2000 yields are. Wait, we
could calculate how much larger exactly:
# Chunk 29
crop_ylds %>%
mutate(y2k = ifelse(year < 2000, "1961_2000", "2000_2024")) %>%
group_by(crop, country, y2k) %>%
summarize(mean_yield = mean(yield)) %>%
pivot_wider(names_from = y2k, values_from = mean_yield) %>%
mutate(yield_ratio = `2000_2024` / `1961_2000`)
#> `summarise()` has grouped output by 'crop', 'country'. You can override using the `.groups`
#> argument.
#> # A tibble: 6 × 5
#> # Groups: crop, country [6]
#> crop country `1961_2000` `2000_2024` yield_ratio
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 maize TZA 1.15 1.60 1.39
#> 2 maize ZMB 1.41 2.29 1.63
#> 3 sorghum TZA 0.839 1.01 1.20
#> 4 sorghum ZMB 0.622 0.707 1.14
#> 5 wheat TZA 1.37 1.39 1.01
#> 6 wheat ZMB 3.43 6.67 1.95We use another mutate to create another variable that is
the ratio of post-2000 and pre-2000 yields. Maybe we are interested in a
straight comparison of pre- and post-2000 maize yield differences across
all countries:
# Chunk 30
crop_ylds %>% filter(crop == "maize") %>%
mutate(y2k = ifelse(year < 2000, "1961_2000", "2000_2024")) %>%
group_by(y2k) %>% summarize(mean_yield = mean(yield)) %>%
pivot_wider(names_from = y2k, values_from = mean_yield) %>%
mutate(yield_ratio = `2000_2024` / `1961_2000`)
#> # A tibble: 1 × 3
#> `1961_2000` `2000_2024` yield_ratio
#> <dbl> <dbl> <dbl>
#> 1 1.28 1.94 1.52We leave it to you in the practice questions below to answer how Chunk 30 differs from Chunk 29.
3.2 Data summaries
Now that we have a sense of how to do split-apply-combine, let’s dig
a bit more into the analyses that we can do as part of that process.
Since R was originally designed for statistics, there are
many possible analyses we can do with R. We’ll look at just
a few common ones, starting with the calculation of summary statistics,
outside of SAC. The easiest way to do that in R is to use
the base summary function.
# Chunk 31
# #1
summary(crop_ylds$harv_area)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 41 33560 155600 604725 802500 5741681
#
# #2
crop_ylds %>%
select(harv_area) %>%
summary()
#> harv_area
#> Min. : 41
#> 1st Qu.: 33560
#> Median : 155600
#> Mean : 604725
#> 3rd Qu.: 802500
#> Max. :5741681
#
# #3
crop_ylds %>%
pull(harv_area) %>%
summary()
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 41 33560 155600 604725 802500 5741681In #1, we use the summary function on
harv_area, pulled out of crop_ylds as a vector. #2
does the same thing using dplyr verbs. Note how
summary() appears at the end of the pipeline, and we don’t
have to give it a variable name because it receives implicitly consumes
everything coming out of the upstream pipe (in this case, just
harv_area). However, that returns summary output in a
table, whereas the base method returns a named vector. #3
produces the equivalent of #1, using dplyr::pull instead of
select. pull is the dplyr equivalent of
crop_ylds$harv_area, i.e. it pulls a variable froma
data.frame or tibble into a vector.
summary is giving us a bunch of quantile values and the
mean from the vector. We can also use it for more than 1 variable:
# Chunk 32
# #1
summary(crop_ylds[, c("harv_area", "prod")])
#> harv_area prod
#> Min. : 41 Min. : 81
#> 1st Qu.: 33560 1st Qu.: 44999
#> Median : 155600 Median : 185810
#> Mean : 604725 Mean : 862514
#> 3rd Qu.: 802500 3rd Qu.: 880799
#> Max. :5741681 Max. :10084000
#
# #2
crop_ylds %>%
select(harv_area, prod) %>%
summary()
#> harv_area prod
#> Min. : 41 Min. : 81
#> 1st Qu.: 33560 1st Qu.: 44999
#> Median : 155600 Median : 185810
#> Mean : 604725 Mean : 862514
#> 3rd Qu.: 802500 3rd Qu.: 880799
#> Max. :5741681 Max. :10084000With #1 showing the base approach, and #2 the dplyr
method. Fairly straight-forward. How about if we want to do it in an SAC
(these approaches are not) where we need to apply some grouping?
# Chunk 33
q1 <- function(x) quantile(x, 0.25)
q3 <- function(x) quantile(x, 0.75)
crop_ylds %>% select(crop, country, harv_area, prod) %>%
group_by(crop, country) %>%
rename(ha = harv_area, p = prod) %>%
summarise_all(list(min, q1, med = median, mu = mean, q3, max))
#> # A tibble: 6 × 14
#> # Groups: crop [3]
#> crop country ha_fn1 p_fn1 ha_fn2 p_fn2 ha_med p_med ha_mu p_mu ha_fn3 p_fn3 ha_fn4 p_fn4
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 maize TZA 790000 4.88e5 1.10e6 1.46e6 1.61e6 2.34e6 2.14e6 3.01e6 3.20e6 4.47e6 5.74e6 1.01e7
#> 2 maize ZMB 362787 4.83e5 5.95e5 8.17e5 7.67e5 1.13e6 8.09e5 1.46e6 1.00e6 1.86e6 1.43e6 3.62e6
#> 3 sorgh… TZA 162000 1.35e5 4.29e5 2.88e5 6.51e5 5.84e5 5.72e5 5.31e5 7.13e5 7.30e5 8.74e5 9.71e5
#> 4 sorgh… ZMB 4524 2.86e3 2.18e4 1.50e4 3.80e4 2.25e4 4.20e4 2.65e4 6.5 e4 3.58e4 1 e5 6.11e4
#> 5 wheat TZA 8000 6.1 e3 4.15e4 5.88e4 5.20e4 7.32e4 5.56e4 7.23e4 6 e4 8.54e4 1.49e5 1.67e5
#> 6 wheat ZMB 41 8.1 e1 1.47e3 4.98e3 1.05e4 5.63e4 1.26e4 7.85e4 2.18e4 1.35e5 4.18e4 2.77e5The above is somewhat more complex. We want to calculate
summary stats for prod and harv_area by
crop and country, but dplyr can’t easily
produce the tabular output for each groups. Therefore, we separately
calculate each statistic in summary, and use
summarise_all() (which applies the summarizing functions to
each selected variable), using funs to pass each function
in to summarise_all(). In the first two lines, we specify
two custom functions, the first calculates the first quartile (25th
percentile), the next calculates the third quartile (75th percentile).
In the pipeline, we also do some renaming gymnastics, changing
harv_area to ha and prod to p, and
then we specify in funs some shorter names for the output
from median and mean. These renamings are in
service of narrowing the width of output columns.
3.3 Associations
Evaluating associations between data variables (e.g. correlations) is
one of the most frequent uses for R. Lets start with
correlations:
# Chunk 34
# #1
cor(crop_ylds[, c("prod", "harv_area")])
#> prod harv_area
#> prod 1.0000000 0.9513794
#> harv_area 0.9513794 1.0000000
#
# #2
crop_ylds %>%
select(prod, harv_area) %>%
cor()
#> prod harv_area
#> prod 1.0000000 0.9513794
#> harv_area 0.9513794 1.0000000
#
# #3
cor.test(x = crop_ylds$prod, y = crop_ylds$harv_area)
#>
#> Pearson's product-moment correlation
#>
#> data: crop_ylds$prod and crop_ylds$harv_area
#> t = 60.368, df = 382, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.9408854 0.9600489
#> sample estimates:
#> cor
#> 0.9513794
#
# # 4
crop_ylds %>%
group_by(crop, country) %>%
summarise(cor = cor(prod, harv_area))
#> `summarise()` has grouped output by 'crop'. You can override using the `.groups` argument.
#> # A tibble: 6 × 3
#> # Groups: crop [3]
#> crop country cor
#> <chr> <chr> <dbl>
#> 1 maize TZA 0.942
#> 2 maize ZMB 0.771
#> 3 sorghum TZA 0.858
#> 4 sorghum ZMB 0.942
#> 5 wheat TZA 0.631
#> 6 wheat ZMB 0.987We first apply stats::cor using base syntax to calculate
the correlation matrix between prod and harv_area
(#1), and then do the same thing with dplyr. In #3 we apply
the cor.test to those two variables, using base syntax,
which produces statistics regarding the confidence interval and
significance of the correlation coefficient. This is harder to do in
dplyr, because the output from cor.test is not
tabular, so we are not doing that here (but there is a tidyverse
oriented corrr package that should help with that). #4
shows how to use cor with summarize when
grouping by crop and country. The output is the
correlation coefficient for the two variables for each crop and country
over the time series (this only works for a two variable correlation
analysis–it fails when a third variable is added). Note the negative
correlation between maize production and harvested area in Tanzania,
whereas all the others are generally strongly positive (as they should
be–generally the more area you plant to a crop, the more tons of it you
produce). We’ll come back to that.
The next step after a correlation is a regression, which quantifies how much a variable Y changes in response to changes in a variable X (or several different X variables). Note: This is not a statistics course, so if you have never seen a regression before, then some of what comes next might be a little opaque. I recommend that you read up on the basics of least squares regression to help understand these examples better. Here is one of many online resources.
Here’s our first regression example:
# Chunk 35
# #1
prodha_lm <- lm(prod ~ harv_area, data = crop_ylds)
prodha_lm
#>
#> Call:
#> lm(formula = prod ~ harv_area, data = crop_ylds)
#>
#> Coefficients:
#> (Intercept) harv_area
#> -52871.524 1.514
#
# #2
summary(prodha_lm)
#>
#> Call:
#> lm(formula = prod ~ harv_area, data = crop_ylds)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2574479 -222099 33457 84947 1860415
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -5.287e+04 2.747e+04 -1.925 0.055 .
#> harv_area 1.514e+00 2.508e-02 60.368 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 448800 on 382 degrees of freedom
#> Multiple R-squared: 0.9051, Adjusted R-squared: 0.9049
#> F-statistic: 3644 on 1 and 382 DF, p-value: < 2.2e-16In #1 above, we use R’s lm (linear model)
function to fit a regression between prod (dependent variable)
and harv_area (independent variable) across the entire
crop_ylds dataset. Note the ~, which is used
to specify the left-hand and right-hand sides of the regression formula.
The output of lm is captured in prod_lm, which
just specifies the model formula and the regression coefficients. To see
more of the model’s results, you have to use summary on the
prodha_lm, as we do in #2, which spits out quite a bit,
including the model fit (significance of coefficients, R2,
etc.).
Given our dataset, this relationship is not that interesting. At the very least, we would want to see the relationship by crop type, and also by crop and year. That requires some splitting:
# Chunk 36
# #1
maize_lm <- lm(prod ~ harv_area,
data = crop_ylds %>% filter(crop == "maize"))
#
# #2
maize_zm_lm <- lm(prod ~ harv_area,
data = crop_ylds %>%
filter(crop == "maize" & country == "ZMB"))
summary(maize_zm_lm)
#>
#> Call:
#> lm(formula = prod ~ harv_area, data = crop_ylds %>% filter(crop ==
#> "maize" & country == "ZMB"))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1127080 -539296 159063 448827 1015488
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -6.497e+05 2.321e+05 -2.800 0.00681 **
#> harv_area 2.608e+00 2.732e-01 9.543 8.81e-14 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 561900 on 62 degrees of freedom
#> Multiple R-squared: 0.595, Adjusted R-squared: 0.5884
#> F-statistic: 91.07 on 1 and 62 DF, p-value: 8.81e-14
#
# #3
maize_tza_lm <- lm(prod ~ harv_area,
data = crop_ylds %>%
filter(crop == "maize" & country == "TZA"))
summary(maize_tza_lm)
#>
#> Call:
#> lm(formula = prod ~ harv_area, data = crop_ylds %>% filter(crop ==
#> "maize" & country == "TZA"))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2608668 -357294 -104795 250739 2101905
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -5.659e+05 1.861e+05 -3.041 0.00345 **
#> harv_area 1.672e+00 7.530e-02 22.200 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 747600 on 62 degrees of freedom
#> Multiple R-squared: 0.8883, Adjusted R-squared: 0.8865
#> F-statistic: 492.9 on 1 and 62 DF, p-value: < 2.2e-16The first split is simply on crop type (#1), so the regression is
maize planted area on production for across countries and years. Notice
the use of dplyr pipes and verbs to split out the maize
observations within the “data” argument of lm. The
coefficients show us that production increases by roughly
1.52 tons per 1 ha increase in planted area. That’s
interesting, because it sounds pretty close to what you would get if
calculated average yield by dividing each production estimate by
harvested area and then taking the average. We leave it to you to
calculate mean yield in the practice below.
In #2, we split out Zambian, and in this case we see that Zambian maize production increases by 2.61 tons per ha planted area, while #3 shows that Tanzanian maize production decreases by 1.67 tons per ha. Remember that negative correlation in Chunk 34 #4 above? In the next section we will take a graphical look at the reasons for this decline.
We could also fit the lm and summary at the
end of the dplyr pipeline:
# Chunk 37
crop_ylds %>% filter(crop == "maize" & country == "TZA") %>%
lm(prod ~ harv_area, data = .) %>% summary()
#>
#> Call:
#> lm(formula = prod ~ harv_area, data = .)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2608668 -357294 -104795 250739 2101905
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -5.659e+05 1.861e+05 -3.041 0.00345 **
#> harv_area 1.672e+00 7.530e-02 22.200 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 747600 on 62 degrees of freedom
#> Multiple R-squared: 0.8883, Adjusted R-squared: 0.8865
#> F-statistic: 492.9 on 1 and 62 DF, p-value: < 2.2e-16That is actually more elegant than what the code in Chunk 36. To make
this work, we have to use the . in the “data” argument
within lm, which represents the
data.frame/tibble coming in from the upstream
part of the pipeline. Note that we don’t need the . in the
summary function. Try run the code above, removing
the . from within lm, to see what
happens.
I’ll end this very brief introduction to analysis with a look at how
we could do SAC with regression, so that we can run all country X crop
prod ~ harv_area regressions. We’ll start with a melange of
base R and dplyr:
# Chunk 38
crop_country_lms <- lapply(unique(crop_ylds$crop), function(x) {
lms <- lapply(unique(crop_ylds$country), function(y) {
dat <- crop_ylds %>% filter(crop == x & country == y)
summary(lm(prod ~ harv_area, dat))$coefficients
})
names(lms) <- unique(crop_ylds$country)
return(lms)
})
names(crop_country_lms) <- unique(crop_ylds$crop)
crop_country_lms
#> $maize
#> $maize$TZA
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -5.659083e+05 1.860748e+05 -3.041294 3.449820e-03
#> harv_area 1.671763e+00 7.530326e-02 22.200408 3.337146e-31
#>
#> $maize$ZMB
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -6.497481e+05 2.320597e+05 -2.799917 6.806310e-03
#> harv_area 2.607587e+00 2.732472e-01 9.542959 8.809769e-14
#>
#>
#> $sorghum
#> $sorghum$TZA
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -79311.770843 4.896830e+04 -1.619656 1.103823e-01
#> harv_area 1.066315 8.097334e-02 13.168723 1.274210e-19
#>
#> $sorghum$ZMB
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2231.4367094 1.237337e+03 1.803419 7.618158e-02
#> harv_area 0.5773724 2.608042e-02 22.138158 3.896236e-31
#>
#>
#> $wheat
#> $wheat$TZA
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.420561e+04 6510.7806906 5.253688 1.940746e-06
#> harv_area 6.856259e-01 0.1070689 6.403593 2.289141e-08
#>
#> $wheat$ZMB
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -7323.512886 2428.4367561 -3.015731 3.713149e-03
#> harv_area 6.831819 0.1394885 48.977638 2.818943e-51This arrangement is similar to what we were doing in Chunk 25b, using
nested lapply to split on crop and country (using
dplyr:filter instead of base R syntax to do
the split), but applying lm inside the bodies of the
lapply. We extract only the coefficient table from
summary of each lm, to condense the
output.
There is a pure tidyverse way of doing this, which is the last thing
to look at here. It is more involved than previous examples, because the
non-tabular (list) output from lm prevents use of
lm within dplyr::summarise:
# Chunk 39
# note: replacement for old variant suggested by answer here:
# https://stackoverflow.com/questions/62972843/retreiving-tidy-
# results-from-regression-by-group-with-broom
crop_ylds %>%
nest_by(crop, country) %>%
mutate(prod_ha_lm = list(lm(prod ~ harv_area, data = data))) %>%
reframe(broom::tidy(prod_ha_lm))This produces a nice output table that lists the coefficients
(estimate), standard errors, t-statistic, and p-value for each
crop-country model. Each model occupies two lines: one for the model
intercept results, one for independent variable results. There are a
couple of extra functions, including do, which is described
by ?do as:
…a general purpose complement to the specialised manipulation functions filter(), select(), mutate(), summarise() and arrange(). You can use do() to perform arbitrary computation, returning either a data frame or arbitrary objects which will be stored in a list. This is particularly useful when working with models: you can fit models per group with do() and then flexibly extract components with either another do() or summarise().
And then there is broom::tidy, which is a function from
the broom, a non-core tidyverse package. The purpose of
this function is to take an input and convert it to an output
data.frame, which is what allows us to produce the messy
lm output into a nice table at the end of the pipeline
above.
Okay, so that’s where we will leave it. Now for some practice.
3.4 Practice
3.4.1 Questions
What function can you use to select a range of values from a
tibble?Looking at chunk 27, and thinking about split-apply-combine operations, identify which line is responsible for splitting the data, what split(s) is (are) being made, and which line is responsible doing the applying, and what is it applying? Finally, which line is doing the combine?
In Chunk 29, what did we do differently compared to Chunk 28?
What is one limitation of
dplyr::summarisewith respect to analyzing groupings? What can we do to overcome that limitation?
3.4.2 Code
This coding practice assumes you have created the
crop_ylds dataset to match its form as of Chunk 21.
Use
dplyrverbs to subset Tanzania sorghum yield from the year 2000 onwards. Do the same but use baseRsyntax.Calculate the
meanandsdof prod, harv_area, and yield in the subset you just made, usingdplyr::summarise_allandfuns. Chunk 33 is your guide.Now calculate
meanandsdof harv_area and prod by crops and by country, usingdplyr. Hint: you will need to usedplyr::group_by. Chunk 33 is also a guide.Use
dplyrto calculate a correlation table between yield and harv_area for Zambian maize. Usecor.testto evaluate this relationship also. Hint: you will need to use the baseRapproach for this from Chunk 34 #3, but to preserve the subset of Zambia yields, first assign the subset to a new objectdat.Use
dplyrverbs to calculate the average maize yield (across, rather than within, countries). How close is that to the regression coefficient on harv_area from Chunk 36 #1?Use
lmto fit regression of between Zambia maize yield and year, with year as the independent variable. Do the same for Tanzania maize yields. Usedplyrverbs to create the subsets, but use both the base-centric (Chunk 36) anddplyrcentric (Chunk 37) way of running thelms. Which country shows the largest annual changes in yields?Challenge: Adapt the code in Chunk 39 to create regressions between yield and year by country and crop, but just for maize and wheat
4 Visualization
Until now we have been looking numerical outputs from our
R code. Graphical representation often conveys more
information, and is typically done alongside numerical analysis. Our
last section on regressions leads us naturally to the subject of
visualization, as the most obvious way to assess a linear relationship
between two variables is to examine their scatterplots. So let’s look at
that to start, but first a quick introduction to two different plotting
approaches.
4.1 graphics graphics versus grid graphics
Back in Module
3 1.1 we mentioned that there is a split in R plotting
methods centered on base R’s graphics and
grid packages. ggplot2 is arguably the most
prominent descendant of the latter branch of graphics.
We are going to learn examples of both approaches in this section. The reasons for this are that:
- Many spatial packages (including
sfand the newerstars) have genericplotfunctions that have syntax modeled ongraphics::plot; - The same spatial packages are also moving toward compatibility with
ggplot2, which is useful to know because that is rapidly becoming a standard.
I would add that it is helpful to know both approaches, because
graphics based plots are often very good for rapid
interactive data exploration, whereas ggplot can be helpful
for creating more polished, presentation-grade graphics.
4.2 Two-dimensional plots
Back to our regression examples. We were exploring relationships between crop production and harvested area. We could get a better sense of that relationship by plotting those two variables against one another.
4.2.1 Scatterplots
We’ll start with graphics::plot, which
?plot tells us is a:
Generic function for plotting of R objects. For more details about the graphical parameter arguments, see par.
There are many, many possible parameters to tune to in making such a
plot (which you can find in ?par), but we will just look a
few. Here’s a start:
This is a purely base way of plotting the relationship between
harv_area on the X axis and versus prod on the Y axis
for just maize (not distinguishing by country). We set up the
relationship between the two variables as a formula, with prod
on the left-hand side (LHS) and harv_area on the right-hand
side (RHS) on the equation, and the data, including the subsetting, in
“data” argument slot.
We could do the same exact thing using the syntax below:
Both are more cumbersome, but note that in both cases we pass our data as separate vectors into the “x” and “y” arguments. We don’t show the plots above, but they are each identical to the one in Chunk 40, except for the axis labels.
That’s not a particularly appealing plot, so let’s dress it up a bit:
# Chunk 42a
plot(prod ~ harv_area, data = crop_ylds[crop_ylds$crop == "maize", ],
pch = 16, cex = 0.7, col = "blue",
main = "Maize production versus harvested area",
xlab = "Harvested area (ha)",
ylab = "Production (tons)")This adds on to the approach in Chunk 40. We use the “pch” argument to change the type of symbol, “cex” changes the size of the symbol, “col” the color of the symbol, and “main”, “xlab”, and “ylab” add titles for the top of the plot and the X and Y axis, respectively.
Looking at th plot, we can see there is a clear positive relationship in the data up to about 3 million ha planted area, and then production seems to decline. What’s going on there?
# Chunk 42b
# #1
zmb_maize <- crop_ylds %>% filter(crop == "maize" & country == "ZMB")
tza_maize <- crop_ylds %>% filter(crop == "maize" & country == "TZA")
# #2
plot(prod ~ harv_area, data = zmb_maize,
pch = 16, cex = 0.7, col = "blue",
main = "Zambia maize prod vs harv area",
xlab = "Harvested area (ha)",
ylab = "Production (tons)")# #3
plot(prod ~ harv_area, data = tza_maize,
pch = 16, cex = 0.7, col = "red",
main = "Tanzania maize prod vs harv area",
xlab = "Harvested area (ha)",
ylab = "Production (tons)")The difference is caused by different production/harvested area
patterns in the two countries in the dataset. In the examples above, we
separately plot each country’s maize dataset, using
dplyr::filter to create two new subsets of the data, which
we feed separately to two new plot chunks, changing the
title and point color in each. This makes the reason for the original
pattern (in Chunk 41) clearer, but is still not ideal. Better to put
both on one plot:
# Chunk 43
# #1
xl <- crop_ylds %>% filter(crop == "maize") %>% pull(harv_area) %>% range()
yl <- crop_ylds %>% filter(crop == "maize") %>% pull(prod) %>% range()
# #2
plot(prod ~ harv_area, data = zmb_maize,
pch = 16, cex = 0.7, col = "blue", xlim = xl, ylim = yl,
main = "Maize production vs harvested area",
xlab = "Harvested area (ha)",
ylab = "Production (tons)")
# #3
points(prod ~ harv_area, data = tza_maize, pch = 16, cex = 0.7, col = "red")
# #4
legend(x = "topleft", legend = c("Zambia", "Tanzania"),
col = c("blue", "red"), pch = 20)That’s much clearer now, and the plot now makes obvious some additional information: Tanzania’s harvested area and total maize production are larger than Zambia’s. Despite these differences in scale, both countries show a clear positive relationship between the two variables, meaning that production increases as harvested area expands rather than declining. Additionally, the steeper slope of Zambia’s cluster suggests it achieves higher yields per unit of land, whereas Tanzania exhibits greater variability in production for similar amounts of harvested area.
Putting those variables on one plot took a few additional steps.
First, we had to make both datasets fit on the same scale. That required
fixing the data ranges on the X and Y axis of the plot.
plot would otherwise automatically set the range to max the
first input dataset (Zambian maize). So (in #1) we first figured out the
X and Y ranges by using dplyr’s filter, and
pull to extract the harv_area and prod
vectors and separately calculate their ranges into new variables
(xl and yl). We then (in #2) passed these in
to plot to set the “xlim” and “ylim” values (controlling
the range of the X and Y axes, respectively). plot was
applied to the zmb_maize. We then used points
(in #3) to add the tza_maize data to the existing plot.
points does what it’s name says–it adds points to a plot.
An existing plot must already be established for points to
work. The arguments are the same as those used in plot.
Finally, in #4, we add a legend so that the reader can distinguish
between the two sets of points. The “x” argument says where the legend
should be placed in the graph, “legend” asks the legend text, “col” for
the point colors, and “pch” for what the point types should be.
That was a fair bit of work. Could we save some effort with
dplyr piping?
# Chunk 44
crop_ylds %>% filter(crop == "maize" & country == "ZMB") %>%
plot(prod ~ harv_area, data = .,
pch = 16, cex = 0.7, col = "blue", xlim = xl, ylim = yl,
main = "Maize production vs harvested area",
xlab = "Harvested area (ha)",
ylab = "Production (tons)")
crop_ylds %>% filter(crop == "maize" & country == "TZA") %>%
points(prod ~ harv_area, data = ., pch = 16, cex = 0.7, col = "red")
legend(x = "topleft", legend = c("Zambia", "Tanzania"),
col = c("blue", "red"), pch = 20)A little bit, it seems. We can subset the dataset using
filter, and then pipe each subset to plot and
points, respectively. To do this, we give . to
the “data” argument in each function in order to receive the piped
dataset.
That’s not in ideal pipeline example, however, because
dplyr pipes are really more designed to work with the
tidyverse’s plotting tools, i.e. ggplot2.
4.2.1.1 Introducing ggplot2
Now’s a good time to introduce ggplot2. Let’s recreate
the first graph as a start:
# Chunk 45
library(ggplot2)
crop_ylds %>% filter(crop == "maize") %>%
ggplot() + geom_point(mapping = aes(x = harv_area, y = prod), col = "blue") +
xlab("Harvested area (ha)") + ylab("Production (tons)") +
ggtitle("Maize production vs harvested area")That’s a basic recreation of the same plot, although
ggplot by default has a grey gridded background (which can
be removed if needed). You can see that the syntax is very
different than it is with plot. ggplot is
based on a “grammar of graphics”, on which there a whole
book, but also a paper by
Hadley Wickham. There is a lot of information there, but some text from
ggplot2 tidyverse
page is instructive:
ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details…It’s hard to succinctly describe how ggplot2 works because it embodies a deep philosophy of visualisation. However, in most cases you start with ggplot(), supply a dataset and aesthetic mapping (with aes()). You then add on layers (like geom_point() or geom_histogram()), scales (like scale_colour_brewer()), faceting specifications (like facet_wrap()) and coordinate systems (like coord_flip()).
There are many functions in ggplot2 (the package) that
control plotting functionality, and we won’t go through them all here
(the cheatsheet
is a nice resource though). So we will go over just a few core parts,
using the example in Chunk 45. We initiated the whole thing from a
dplyr pipeline that we used to subset the maize part of
crop_ylds, which we piped to ggplot, which
(according to ?ggplot) “intializes a ggplot object”. It is
empty in this case because it is receiving the piped input, otherwise we
would to give a data.frame (let’s note there that
ggplot only works with data.frames, unlike
plot, which can work with vectors).
We then pass that to geom_point, using +,
which is somewhat analogous to dplyr’s %>%,
but in this case is used to build up ggplot objects.
geom_point is the geometry function for creating
scatterplots. There are many different geom_* functions in
ggplot2, one for each kind of plot. We’ll see more of those
later.
Inside geom_point we use the function aes
to satisfy the “mapping” argument. Is stands for “aesthetic mapping”,
which “…describe how variables in the data are mapped to visual
properties (aesthetics) of geoms.” (from ?aes).
xlab, ylab, and ggtitle should be
fairly self-explanatory.
An alternate was of constructing the same plot is this:
# Chunk 46
ggplot(data = crop_ylds %>% filter(crop == "maize"),
mapping = aes(x = harv_area, y = prod)) +
geom_point(col = "blue") + xlab("Harvested area (ha)") +
ylab("Production (tons)") + ggtitle("Maize production vs harvested area")That is not run, to save space, but is identical. In this case,
insert the tibble directly into ggplot’s
“data” argument, and the aes is supplied to the “mapping”
argument. We still need to supply the geom_point argument,
but now we just tell what point color we want (“col” argument). The rest
is the same.
So the above examples just show ggplot with
prod versus harv across both countries. How would we
distinguish between the countries, as we did in Chunks 43 and 44? By
making just two small changes to the code we used in Chunk 45 (same
would apply to Chunk 46)
# Chunk 47
crop_ylds %>% filter(crop == "maize") %>%
ggplot() + geom_point(aes(x = harv_area, y = prod, color = country)) +
xlab("Harvested area (ha)") + ylab("Production (tons)") +
ggtitle("Maize production vs harvested area")All we did here was supply the “color” argument inside
aes, which in this case is the country variable.
This means that geom_point (note we drop the argument name
for “mapping” now, which is fine) will assign a different color to each
group of points based on their unique values in country
variable. It then very conveniently 1) creates a legend, and 2)
automatically scales the axes to cover the full data ranges, and, of
course, 3) assigns a different color to each point grouping. The other
change we made was to remove the “col” argument that was originally
outside aes in Chunk 45, which assigned a single color to
all points. That’s a key point to remember: “color” (or “colour”–you
have probably already noticed that some tidyverse functions have
variants for both UK and US English spellings) is supplied inside
aes to create differentiated colors, while col
is used outside, and will map a single color.
I don’t much care for ggplot default color palette, so I
like to change them:
# Chunk 48
crop_ylds %>% filter(crop == "maize") %>%
ggplot() + geom_point(aes(x = harv_area, y = prod, color = country)) +
scale_color_manual(values = c("red", "blue")) +
xlab("Harvested area (ha)") + ylab("Production (tons)") +
ggtitle("Maize production vs harvested area")Which in this case is done using scale_color_manual and
the vector of color names pssed to “values”.
4.2.2 Line plots
Right, those are some very basic scatter plots we just looked at, and what they are showing is that maize production and harvested area have differing relationships between countries. Let’s explore this some more, and look at some acquire some new plotting skills in the process.
Remembering that yield production/harvested area (tons/ha), and the data we are examining are a time series, we might get a better sense of the data by plotting yield over time:
# Chunk 49
crop_ylds %>% filter(crop == "maize") %>%
ggplot() + geom_point(aes(x = year, y = yield, color = country)) +
scale_color_manual(values = c("red", "blue")) +
xlab("Year") + ylab("Yield (tons / ha)") +
ggtitle("Maize yield (1961-2024)")Interesting. It seems that both countries show an increasing trend in yield since 1961, with a more rapid increase after 2000, particularly in Zambia. But the point are noisy, so maybe visualizing these as lines would make things a clearer?
# Chunk 50
crop_ylds %>% filter(crop == "maize") %>%
ggplot() + geom_line(mapping = aes(x = year, y = yield, color = country)) +
scale_color_manual(values = c("red", "blue")) +
xlab("Year") + ylab("Yield (tons / ha)") +
ggtitle("Maize yields (1961-2024)")That does make things a bit clearer–and it suggests that Zambian
yields diverged from Tanzanian yields in about the 2000s. There was a
huge drop in maize yield in 2000s in Tanzania. The lines also better
illustrate the between-year variations in yields, which seem to be
larger in Zambia than Tanzania. All we had to do to make the change was
swap out geom_point for geom_line.
For comparison, here is how we would do that same plot with
graphics methods:
# Chunk 51
crop_ylds %>% filter(crop == "maize" & country == "ZMB") %>%
plot(yield ~ year, data = ., type = "l", col = "blue",
ylim = range(crop_ylds %>% filter(crop == "maize") %>% pull(yield)),
main = "Maize yield (1961-2024)")
crop_ylds %>% filter(crop == "maize" & country == "TZA") %>%
lines(yield ~ year, data = ., col = "red")Pretty much the same set-up as Chunk 44, using the
type = "l" argument in plot to plot lines
instead of points for the Zambian maize yields, and then the
lines function (instead of points) to add the
Tanzanian yield time series.
4.2.3 Points and lines
Those line comparisons still only give us a visual comparison.
Fitting a trend line to the points would make things clearer. We can do
that quite easily with ggplot:
# Chunk 52
crop_ylds %>% filter(crop == "maize") %>%
ggplot() + geom_point(aes(x = year, y = yield, color = country)) +
geom_smooth(aes(x = year, y = yield, color = country)) +
scale_color_manual(values = c("red", "blue")) +
xlab("Year") + ylab("Yield (tons / ha)") +
ggtitle("Maize yields (1961-2024)")
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'The above uses geom_smooth to add a best-fitting curve
through each country’s yield time series. This function defaults to
loess, which fits a smoothing spline, or local polynomial
regression, to the points (see ?loess for more details). It
also provides a confidence interval around that fit. This now shows the
steeper yield increase in Zambia starting around 2000.
We can also just examine the linear (rather than curvilinear) trends:
# Chunk 53
crop_ylds %>% filter(crop == "maize") %>%
ggplot() + geom_point(aes(x = year, y = yield, color = country)) +
geom_smooth(aes(x = year, y = yield, color = country), method = "lm") +
scale_color_manual(values = c("red", "blue")) +
xlab("Year") + ylab("Yield (tons / ha)") +
ggtitle("Maize yield (1961-2024)")
#> `geom_smooth()` using formula = 'y ~ x'That gives us a straight line fit through each time series, which is
calling on the lm function to fit the trendlines, which
shows a steeper line over the whole time series.
With graphics::plot, we have to fit the trendlines more
explicitly:
# Chunk 54
zmb_maize <- crop_ylds %>% filter(crop == "maize" & country == "ZMB")
tza_maize <- crop_ylds %>% filter(crop == "maize" & country == "TZA")
yl <- range(crop_ylds[crop_ylds$crop == "maize", "yield"])
plot(yield ~ year, data = zmb_maize, pch = 16, col = "blue",
ylim = yl, main = "Maize yield (1961-2024)")
abline(lm(yield ~ year, data = zmb_maize), col = "blue")
points(yield ~ year, data = tza_maize, pch = 16, col = "red")
abline(lm(yield ~ year, data = tza_maize), col = "red")We recreate the two subset tibbles
zmb_maize and tza_maize, create a new variable
yl containing the range of yields across both datasets,
plot points for zmb_maize, and use lm to fit
the regression between yield and year in
zmb_maize, then wrap that in the abline
function, which converts the regression model to a prediction line
(i.e. the trendline). We then points with
tza_maize to add the scatter plot for Tanzanian maize, and
fit another abline on the lm for that dataset.
That’s more involved than the ggplot approach.
4.3 One-dimensional plots
We have been looking at plots using two variables so far. Now let’s
look at some of the more commonly used visualizations for single
variables. Let’s start with a histogram of sorghum yields, starting
first with ggplot:
# Chunk 55
crop_ylds %>% filter(crop == "sorghum") %>%
ggplot() + geom_histogram(aes(x = yield), fill = "blue", col = "black") +
ggtitle("Distribution of sorghum yields")
#> `stat_bin()` using `bins = 30`. Pick better value `binwidth`.We grab sorghum out of crop_ylds, and then
geom_histogram with an aes mapping
yield to the x variable. We specify the color for the histogram
bar “fill” and the bar outline color (“col”) outside of
aes. The output suggests that we suggest a better bin
width. Indeed, the bars are too narrow and there are data gaps. Let’s
try again with a different number of bins:
# Chunk 56
crop_ylds %>% filter(crop == "sorghum") %>%
ggplot() +
geom_histogram(aes(x = yield), bins = 15, fill = "blue", col = "black") +
ggtitle("Distribution of sorghum yields")We specify that we want 15 bins, rather than the default 30. Looks a
bit better. Here’s how do it using graphics::hist:
# Chunk 57
hist(x = crop_ylds$yield[crop_ylds$crop == "sorghum"], xlab = "yield",
main = "Distribution of sorghum yields", col = "blue")We’ll keep focusing on ggplot though. How about a
histogram breaking sorghum yields down by country?
# Chunk 58
crop_ylds %>% filter(crop == "sorghum") %>%
ggplot() +
geom_histogram(aes(x = yield, color = country), bins = 15) +
ggtitle("Distribution of sorghum yields")As with scatter plots, specifying the “color” argument within
aes separates the bars by country, making stacked
histograms. What about changing the fill, rather than the bar outline
color?
# Chunk 59
crop_ylds %>% filter(crop == "sorghum") %>%
ggplot() +
geom_histogram(aes(x = yield, fill = country), bins = 15) +
ggtitle("Distribution of sorghum yields") +
scale_fill_manual(values = c("red", "blue")) We specify that fill = country in aes.
Okay, so now it is clear that in polygonal object, we distinguish
between “fill” and “color” within aes (or “fill” and “col”
outside of it). And if we want to change the fill colors from
ggplot’s defaults, as we did with scatter plots above, we
use the scale_fill_manual function.
We could also make this a side-by-side histogram, rather than a stacked one:
# Chunk 60
crop_ylds %>% filter(crop == "sorghum") %>%
ggplot() +
geom_histogram(aes(x = yield, fill = country), bins = 15,
position = "dodge", col = "black") +
ggtitle("Distribution of sorghum yields") +
scale_fill_manual(values = c("red", "blue")) We achieve that by using the argument position = "dodge"
(outside of aes). We also use col = "black"
outside of aes to add a black line around each box.
So those are histograms. How about a simpler barplot for counting a categorical variable?
# Chunk 62
crop_ylds %>% filter(crop == "wheat") %>%
mutate(yld_levels = ifelse(yield < 3, 1, 2)) %>%
mutate(yld_levels = ifelse(yield > 6, 3, yld_levels)) %>%
ggplot() + geom_bar(aes(x = yld_levels), fill = "blue", col = "black") +
xlab("Yield levels")Here we use geom_bar to make a bar plot from the
categorical variable yld_levels, which we create using
mutate and ifelse to group wheat yields into 3
categories (yield < 3 t/ha; 3-6 t/ha; >6 t/ha). We make the bars
blue-filled and outlined with black box.
Last plot we will show is the boxplot:
Pretty straightforward. We get a separate boxplot for each crop by
specifying crop as the “x” argument and yield as “y”
in
aes within geom_boxplot. Another way to
show the distribution of values within a variable. We can also easily
distinguish separate boxplots for each country:
# Chunk 64
crop_ylds %>%
ggplot() + geom_boxplot(aes(x = crop, y = yield, fill = country)) +
xlab(NULL) + scale_fill_manual(values = c("red", "blue")) We simply make country the variable on which to “fill” in
aes, and then change the fill color to red and blue.
4.4 Multi-panel plots
Thinking back to our scatter plots, there is more to the story regarding the differing yield trends between Tanzania and Zambia, and it requires multiple plots to really tell it. Let’s have a look at yield, production, and harvested area trends over time to see what’s going on:
# Chunk 65
# #1
p1 <- crop_ylds %>% filter(crop == "maize") %>%
ggplot() + geom_point(aes(x = year, y = yield, color = country)) +
scale_color_manual(values = c("red", "blue")) +
ylab("Yield (tons/ha)") + xlab("") +
geom_smooth(aes(x = year, y = yield, color = country))
p2 <- crop_ylds %>% filter(crop == "maize") %>%
ggplot() + geom_point(aes(x = year, y = prod, color = country)) +
scale_color_manual(values = c("red", "blue")) +
ylab("Production (tons)") + xlab("") +
geom_smooth(aes(x = year, y = prod, color = country))
p3 <- crop_ylds %>% filter(crop == "maize") %>%
ggplot() + geom_point(aes(x = year, y = harv_area, color = country)) +
scale_color_manual(values = c("red", "blue")) +
ylab("Harvested area (ha)") + xlab("") +
geom_smooth(aes(x = year, y = harv_area, color = country))
# #2
gp <- cowplot::plot_grid(p1 + theme(legend.position = "none"),
p2 + theme(legend.position = "none"),
p3 + theme(legend.position = "none"), nrow = 1,
align = "vh", axis = "l")
# #3
gp2 <- cowplot::plot_grid(gp, cowplot::get_legend(p1), rel_widths = c(3, 0.3))
# #4
# REPLACE THE filename WITH YOUR OWN!!!!
ggsave(gp2, filename = "materials/fig/u1m4-1.png", width = 9, height = 2.5,
units = "in", dpi = 300) This is a bit more advanced, but worth having a look at. In #1, we
replicate our code from Chunk 52, using aes() with the
“color” argument to create separate plots by country. We create
one ggplot object for each variable yield,
prod, and harv_area. We assign each plot to its own
object, which prevents it from printing to the graphics device.
In #2, we use the function plot_grid from the
cow_plot package to arrange the three plots into a single
plot. We specify that the plots will fall onto one row
(nrow = 1), and that it should align the three plots
vertically and horizontally (align = "vh") against the left
axis (axis = "l"). We use an additional argument assigned
to each plot object, which tells the plot to render without its legend
(theme(legend.position = "none")). That is done to prevent
three identical legends being created, one for each plot, which would
take up valuable plot real estate.
In #3, we use cowplot::plot_grid again, putting the
three panel plot object as the first argument, and then use a function
to grab the legend from one of the individual plots
(cowplot::get_legend(p1)), which will make it appear to the
right of the three scatter plots. rel_widths(c(3, 0.3)
specifies the relative width of each plot in the row, so the legend will
take <10% of the total plot space (0.3 / (3 + 0.3)).
In #4 we use ggsave to write the whole plot to a png,
which we then read back in to display within the vignette (see the Rmd
code). The reason for that is to avoid making sdadata
dependent on cowplot to build vignettes. So to run this
code you will have to install.packages(cowplot).
So what is the story told by this plot? Both countries began with similar agricultural profiles in 1960, their paths diverged significantly over time. Tanzania experienced a massive expansion in harvested area that fueled a corresponding surge in total production, whereas Zambia’s harvested area remained relatively stable and small-scale. Despite this difference in land expansion, Zambia achieved much higher yield gains, producing more tons per hectare than Tanzania by the 2020s. Ultimately, Tanzania’s production growth was driven primarily by land expansion with modest yield improvements, while Zambia focused on intensifying productivity on a much smaller land footprint.
We just showed a more complicated way of making a multi-panel plot.
There is a simpler way to do this with ggplot2 functions,
provided that the variables on your two axes don’t change between plot
panels. We can illustrate this by showing the yield time series
for each crop:
# Chunk 66
crop_ylds %>%
ggplot(.) + geom_point(aes(x = year, y = yield, color = country)) +
geom_smooth(aes(x = year, y = yield, color = country)) +
facet_grid(cols = vars(crop)) +
scale_color_manual(values = c("red", "blue")) +
ylab("Yield (tons/ha)") + xlab("")
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'In this example, we use ggplot2::facet_grid to split the
output into three plots, one per crop type, specifying that we want the
values in vars(crop) to split into columns. For
facet_grid, you have to wrap the column name in
vars().
We can also specify that by rows:
# Chunk 67
crop_ylds %>%
ggplot() + geom_point(aes(x = year, y = yield, color = country)) +
geom_smooth(aes(x = year, y = yield, color = country)) +
facet_grid(rows = vars(crop)) +
scale_color_manual(values = c("red", "blue")) +
ylab("Yield (tons/ha)") + xlab("")
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'So that’s it for our visualization section. There are many other graphics we could make, and we will learn some of these as we continue into the spatial unit of the course.
4.5 Practice
4.5.1 Questions
What are some of the primary differences between plots made using the
graphicspackage and those made withggplot2?ggplot2seems much prettier and maybe more convenient. So why are we learning about basegraphicsplots as well?How do the axis labels in the two examples given in Chunk 41 differ from the plot in Chunk 40? How could you change the names of the labels?
You can use
graphics::plotwithdplyrpipelines. In order for it to work, what argument and argument value do you have to do to use?How do you change the point color, point style, and point size in a
graphics::plot?What happens when you run
ggplotwithout ageom_*argument? You can run the code in Chunk 45 with thegeom_pointpart removed to check.
4.5.2 Code
Run the code in Chunk 56 without the “fill” and “col” arguments. Then run it, with just
fill = "red"(no “col” argument).Using
crop_ylds, split out the Tanzania wheat data (withdplyr) and useplotto make a time series between year and harv_area, using first points and then lines (hint: use thetype = "l"argument for lines). Make the points solid blue and the lines blue. Give a title “Tanzania Wheat (1961-2024)”, and make the axis labels empty for the year axis and “Harvested area (ha)” for the y axis.Recreate the same two plots (labels, title, colors) using
ggplot’sgeom_pointand thengeom_line.Building on the previous, create a
ggplotthat separates the harv_area trends by country. Usegeom_lineto show the two trends. Change the defaultggplotcolors to blue for Zambia and red for Tanzania (i.e. usescale_fill_manual). Extra: You will see that the Tanzania harv_area values are large relative to Zambia’s, so it is hard to see how they change over time. Wrap thelog10()function around harv_area inaes, and replot for a more informative visualization.Plot just the Tanzania wheat time series, but use
geom_pointsandgeom_smoothto plot the point with a smooth trend through it. In other words, use the code you used for the first part of code practice 3 above, but addgeom_smooth.Create a histogram from Zambia’s wheat planted area. Do it using both
ggplot2andgraphics::histmethods. Add titles and informative x axis labels to both. Make the bars blue with black outlines.Let’s display wheat, sorghum, and maize harvested areas for Tanzania on side-by-side scatter plots, each with a smooth trendline fit through it. That requires adapting Chunk 66’s code. Hints: leave out
color = countryingeom_pointsandgeom_smooth, and usedplyrto filter for Tanzanian data.
5 Unit assignment
5.1 Set-up
Make sure you are working in the master branch of your project (you should not be working on the a2 or a3 branch). Create a new vignette named “module4.Rmd”. You will use this to document the tasks you undertake for this assignment.
There are no package functions to create for this assignment. All work is to be done in the vignette.
5.2 Vignette tasks
- Create three
tibbles,t1,t2,t3.t1has 10 rows, with columnV1containing values G1 through G10, andV2containingrunifbetween 75 and 125.t2has 15 rows withv1(G1 - G15) andv3containing a random selection ofLETTERSA-F.t3has 20 rows (v1with G1-G20), andv4with numbers from the random normal distribution (mean= 100,sd= 20). Use a seed of 1 for random numbers. Joint1,t2, andt3within a single pipeline, using:
left_joinright_joinfull_joininner_join
Recreate the
crop_yldsdataset, using 1) anlapplyto read in each .csv file from the packageextdata/folder, and 2) thedplyrsteps necessary to*_jointhe data and make the necessarymutate-ations. Chunks 1, 11, 16, 19, and 21 are your guides.Use
dplyrverbs to select the 5 top-ranked years for total harvested area for Tanzanian maize. Do the same for Tanzanian maize yields. To do this, you will need to usefilter,arrange, andslice. The outputs for each test should be the 5 rows ofcrop_yldsthat meet these criteria.Calculate the mean of each crop’s yield (across both countries) using SAC based on
dplyr, as well as ansapplyusing baseRsyntax within thesapplyto subset on crop (note, subsetting a tibble is a bit different, so use this syntax to do the job within thesapply:mean(crop_ylds[crop_ylds$crop == x, ]$yield))Calculate a correlation matrix between harv_area and yield for each crop-country combination, using
dplyrverbs. Arrange the result (negative to positive) by the value of the correlation coefficient. See Chunk 34 for guidance.Create a single scatter plot with
ggplotthat shows the relationship between harv_area (x-axis) and yield (y-axis) for maize, separated by country on a single plot. Make it a point scatter plot, with a straight trendline fit through each set of points (i.e.method = "lm"). You will need to usegeom_pointandgeom_smooth. Make a title (“Harvested area versus yield”) and x (“Harvested area (ha)”) and y (“Yield (t/ha)”) labels.
Create a single scatter plot with
graphics::plotthat plots just Tanzanian wheat yields (y-axis) against year (x-axis). Plot the points, and then add a linear trendline to it, by wrapping theablinearound thelmfunction. Make the points solid grey (“grey”) and theablineblue. Label the y axis as “Yield (t/ha)”. Remove the x-axis label. Give a title: “Tanzanian wheat (1961-2024)”. Chunk 54 is your guide.Use
ggplotto make a 5-bin histogram of Zambia’s maize yields. The x-label should be “Yield (t/ha)”, the title should be “Zambian Maize”, and bins should be blue with black outlines.
5.3 Assignment output
Refer to Unit 1 Module 3’s Assignment output section for submission instructions. The differences are as follows:
Your submission should be on a new side branch “a3”;
You should increment your package version number by 1 on the lowest number (e.g. from 0.0.1 to 0.0.2) in the DESCRIPTION;
Control the width of your plots in 6-8 using the following chunk options:
For all chunks, suppress messages and warnings by adding
message = FALSEandwarning = FALSEto the chunk options, e.g.